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Abstract 

This  thesis  examines  the  discrimination  of  targets  with  Ultra  High  Range  Resolution 
(UHRR)  radar  data.  Using  these  measured  signals  from  frontal  aspect  angles  of  four  aircraft 
classes,  the  baseline  performance  of  the  Adaptive  Gaussian  Classifier  (AGC)  is  tested  with 
respect  to  aligning  exemplars  to  templates.  Alignment  plays  a  crucial  role  in  the  AGC’s 
classification  performance  which  can  degrade  by  1 1  %  for  a  target  class.  The  AGC  is  compared 
to  non-parametric  classifiers,  but  no  statistically  significant  degradation  of  performance  is 
found.  Data  separability  is  analyzed  by  bounding  the  Bayes  error.  The  data  is  well  separated 
in  a  statistical  sense.  A  feature  selection  algorithm,  based  on  analysis  of  the  decision  boundary, 
is  applied  to  find  a  reduced  feature  set,  which  are  linear  combinations  of  the  original  features. 
These  features  are  optimized  with  respect  to  classification  error  rather  than  reconstruction 
error.  This  technique  is  extended  to  deduce  the  relevant  features  in  the  original  feature  space. 
Fewer  than  5%  of  the  features  in  the  original  feature  space  may  be  used  to  attain  an  improved 
classification  rate.  This  new  method  is  a  true  reduction  of  features  and  shows  improvement 
up  to  15%.  Discrimination  of  UHRR  radar  signatures  using  a  multiresolution  analysis  is 
proposed.  The  decision  boundary  analysis  chooses  relevant  wavelet  scales  with  respect  to 
classification.  Some  improved  performance  against  an  entropy  based  measure  is  observed  for 
limited  feature  sets.  The  technique  developed  here  successfully  chooses  the  scale  that  causes 
classification  performance  to  peak  within  5%  of  the  performance  in  the  full-dimensional  or 
reduced-dimensional  UHRR  radar  signature  space. 
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Classification  of  Ultra  High  Range  Resolution  Radar  Using  Decision 

Boundary  Analysis 


/.  Introduction 

1.1  Background 

Non-Cooperative  Target  Identification  (NCTI)  is  a  top-priority  area  of  research  for  the 
Air  Force.  NCTI  is  an  Automatic  Target  Recognition  (ATR)  approach  that  involves  the  ability 
to  identify  targets  without  active  transmitters  or  transponders  on  the  target.  ATR  enables 
a  searcher  to  distinguish  between  friendly  (blue)  and  other  forces  (red/gray).  In  the  case  of 
identifying  blue  systems,  active  devices,  such  as  Identification,  Friend  or  Foe  (IFF)  transmitters 
are  used,  but  they  cost  money  to  develop  and  add  a  point  of  vulnerability  on  the  battlefield.  A 
reliable  NCTI  capability  holds  the  promise  of  less  risk  of  intercept  of  tell-tale  electromagnetic 
(EM)  signals  on  the  battlefield.  Also,  the  need  for  the  design  and  procurement  of  such  lET 
devices  is  eliminated.  One  common,  non-cooperative  source  of  information  about  a  target 
lies  embedded  in  its  radar  signature.  Reflected  EM  waves  are  a  readily  accessible  source  of 
information  unique  to  the  geometry  and  composition  of  a  target.  The  ongoing  challenge  is  to 
find  ways  to  extract  and  use  this  information  effectively. 

Ultra  High  Range  Resolution  (UHRR  radar)  radar  is  a  type  of  radar  that  can  be  used  for 
NCTI  (34,  44, 1).  As  in  all  radar  processing  techniques,  total  energy  returned  from  a  target  is 
distributed  into  “range  bins.”  Range  bins  contain  the  energy  reflected  from  target  structures  at 
incremental  distances  from  the  radar  source.  Using  UHRR  radar  signatures  for  NCTI  is  based 
on  the  premise  that  signatures  contain  unique  information  about  a  target  class.  This  premise 
is  intuitively  attractive  for  ATR  because  one  may  use  this  unique  information  for  recognition 
if  one  can  discover  where  the  discriminantly  relevant  information  lies. 
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Statistical  pattern  recognition  uses  data  extracted  from  targets,  such  as  UHRR  radar  sig¬ 
natures,  to  classify  them.  Tools  include  parametric  (Gaussian  classifiers)  and  non-parametric 
(^-nearest  neighbor  and  neural  network)  systems.  A  Gaussian  classifier  is  currently  being  pur¬ 
sued  in  the  UHRR  radar  problem  at  WL/AARA.  This  approach  makes  important  assumptions 
about  the  statistics  of  the  data.  Fukunaga  and  Martin  applied  non-parametric  density  estima¬ 
tions  to  computer  generated  UHRR  radar  signatures,  making  predictions  about  the  underlying 
information  content  in  a  signature  (20, 33).  Earlier  Air  Force  Institute  of  Technology  students, 
Dewitt  and  Kouba,  used  Hidden  Markov  Models  and  recurrent  neural  networks  to  classify 
targets  (26,  15).  Hughes  aircraft  developed  an  Adaptive  Gaussian  Classifier  (AGC)  which  is 
currently  being  exercised  at  Wright  Laboratories.  In  the  general  case,  the  AGC  is  adaptive 
in  the  sense  that  mean  and  variance  computations  are  adjusted  to  compensate  for  amplitude 
changes  within  the  incoming  signals  (34: 12). 

The  critical  problem  in  pattern  recognition  is  how  to  extract  data  (or  features)  from 
targets  efficiently.  “Efficiently”  has  a  dual  meaning.  First,  it  means  transforming  or  projecting 
the  raw  data  into  a  convenient  set  of  parameters  (Le.  features)  representing  that  object.  The 
parameters  must  be  energy  normalized  and  properly  aligned  for  comparison.  “Efficiently” 
also  means  using  only  those  features  that  are  relevant  to  distinguishing  classes  of  targets 
for  the  problem  at  hand.  Every  feature  included  in  an  analysis  has  a  computational  cost 
associated  with  it.  Also,  it  is  possible  to  include  too  many  features,  saturating  the  capability 
of  the  classifier  for  the  amount  of  training  data  available  (23:204).  A  priori  knowledge  of  the 
problem  may  be  used  to  help  choose  effective,  discriminating  features. 

1.2  Problem  Statement 

This  thesis  investigates  raw  UHRR  radar  signatures  as  feature  vectors  in  the  classifi¬ 
cation  of  aircraft.  The  sensitivity  of  classification  error  to  proper  alignment,  radar  “flashes,” 
and  choice  of  classifier  is  studied.  An  important,  related  problem  is  identifying  those  features 
which  contain  information  relevant  to  successful  classification.  A  technique  for  doing  so  is  ap- 
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plied  and  modified  to  choose  an  efficient  set  of  features  in  the  raw  data  and  in  a  multiresolution 
wavelet  decomposition  of  the  data. 

1.3  Scope 

This  thesis  analyzes  the  ability  of  the  AGC  to  properly  align  and  classify  efficient 
features  in  the  context  of  UHRR  radar  data.  The  situation  analyzed  is  a  four  class  problem  with 
approximately  1000  signatures  of  measured  data  available  for  each  class.  The  current  AGC  will 
be  baselined  by  controlling  which  signatures  the  AGC  will  train  on  and  classify.  This  test  will 
show  some  signatures  contain  radar  “flashes”  that  cause  the  AGC  to  improperly  align  signatures 
with  the  correct  class  template.  The  utility  of  a  Gaussian  classifier  in  this  pattern  recognition 
problem  is  tested  by  comparing  its  performance  to  non-parametric  classifiers  with  the  same 
data  sets.  The  information  content  of  the  radar  returns  is  explored  by  bounding  the  Bayes 
error.  Finally,  discriminant  analysis  is  applied  to  the  problem  using  the  decision  boundary 
generated  by  the  quadratic  Gaussian  classifier.  The  goal  is  to  determine  the  discriminantly 
relevant  portions  of  the  signatures.  In  a  two  class  situation,  it  is  shown  that  specific  frequency 
bands,  generated  with  a  wavelet  decomposition  contain  most  of  the  pertinent  information  for 
classification. 

1.4  Approach 

This  thesis  begins  by  baselining  the  Hughes’  AGC  as  developed  for  the  NCTI  program  at 
Wright  Laboratories.  Classification  rates  are  estimated  by  training  and  testing  using  the  holdout 
method  and  repeating  over  many  independent  trials  (19:220-221).  The  impact  of  feature 
vectors  which  are  not  properly  aligned  by  the  AGC  is  analyzed  by  withholding  signatures 
that  are  not  properly  aligned  from  the  classification  process  and  seeing  whether  classification 
improves.  The  effects  of  alignment  are  analyzed  by  hand  aligning  the  problem  signatures  and 
then  using  the  AGC.  The  classification  rates  for  each  case  are  compared  to  see  which  effects 
have  a  significant  impact  on  the  classification  rate.  A  it-Nearest  Neighbors  (jt-NN)  classifier 
and  a  neural  network  are  used  to  test  whether  the  statistical  nature  of  the  UHRR  radar  data 
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may  be  better  represented  by  a  non-parametric  classifier.  Class  separability  is  measured 
by  performing  a  Bayes  Error  Bounding  test  as  described  by  Fukunaga  and  implemented  by 
Martin  (20,  33). 

After  assessing  these  questions,  feature  discrimination  is  explored.  First,  the  idea  that 
some  features  are  more  relevant  to  classification  than  others  is  examined.  Then,  the  analysis 
investigates  the  premise  that  information  relevant  to  a  target  can  be  found  in  distinct  bands 
of  the  frequency  domain.  The  tool  used  to  explore  this  idea  is  the  wavelet,  because  wavelet 
decompositions  systematically  break  out  the  frequency  domain  into  orthogonal  bases.  The 
basic  idea  originates  from  Coifrnan  (10:714)  and  measures  the  information  content  of  the 
bands  with  an  entropy  measure.  Chang  applies  the  li  norm  as  an  entropy  measure  and  uses 
the  results  for  class  discrimination  with  respect  to  two  dimensional  textures.  This  thesis  finds 
discriminantly  relevant  bands  with  a  modified  version  of  a  discrimination  technique  described 
by  Lee  and  Landgrebe  (29). 

1.5  Objectives 

There  are  five  objectives  of  this  thesis: 

1.  Describe  and  document  the  implementation  of  the  Hughes  AGC  as  provided  for  this 
research.  (Chapter  II,  Section  2.5) 

2.  Examine  the  effect  of  proper  alignment  during  training  and  testing  on  the  AGC’s 
performance.  Examine  the  effect  of  signatures  with  noise  flashes  (corrupt  signatures) 
even  when  properly  aligned  to  class  templates.  (Chapter  III,  Section  3.3) 

3.  Determine  whether  a  non-parametric  classifier  may  be  better  suited  to  the  UHRR  radar 
problem.  (Chapter  III,  Section  3.4) 

4.  Quantify  whether  the  UHRR  radar  data  is  separable  by  estimating  the  Bayes  error. 
(Chapter  III,  Section  3.5) 
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5.  Extract  discriminantly  relevant  features  with  the  raw  data  and  with  a  multiresolution 
analysis  to  improve  classification  or  reduce  the  number  of  features  required  for  similar 
performance.  (Chapter  III,  Section  3.7  and  Chapter  V,  Section  5.2) 

1.6  Organization 

Chapter  II  begins  with  a  description  of  the  statistical  and  mathematical  theories  which 
support  the  methodology  and  results  of  this  thesis.  Also  included  is  a  description  of  how  the 
AGC  works.  Chapter  III  describes  the  evaluation  of  the  AGC  with  respect  to  the  alignment, 
registration,  and  corrupted  signature  issues.  Bayes  error  analysis  is  used  to  provide  a  measure 
of  classification  capability  and  class  separability.  Feature  analysis  is  used  to  quantify  where 
in  the  signatures  the  discriminantly  relevant  information  lies.  Chapter  IV  presents  results  for 
the  statistical  and  feature  analyses  described  in  Chapter  III.  Chapter  V  explores  the  use  of 
wavelet  multiresolution  analysis  to  find  those  frequency  bands  where  discriminantly  relevant 
information  lies. 
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//.  Theory 


2.1  Introduction 

The  underlying  mathematics  behind  extracting  features  and  discriminating  in  an  di¬ 
mensional  feature  space  are  well  understood  within  the  context  of  statistical  pattern  recognition 
(19).  A  key  factor  is  to  find  the  balance  between  computation  time  and  classification  accuracy. 
Another  key  element  is  having  enough  sample  data  to  establish  a  reasonable  representation  of 
class  boundaries  for  the  data  being  analyzed. 

Linear  algebra  is  the  tool  used  to  frame  the  problem.  Representations  of  signals 
(features)  may  be  imagined  m  zn  N  dimensional  Euclidean  space,  ,  where  N  is  the 
number  of  features.  By  extracting  features  from  a  signal,  a  feature  vector  is  projected  into 
and  is  denoted  as  a  column  vector,  x  —  [xix^  .  ■ .  Xat]^.  Samples  of  a  class  of  target 
would  be  expected  to  share  common  features  (and,  thus,  identical  feature  vectors),  but  noise 
causes  actual  measurements  to  vary  within  R^ .  For  the  UHRR  radar  problem,  noise  comes 
from  uncertainties  in  relative  positions  of  target  and  radar,  atmospheric  effects,  and  equipment 
variations.  Probability  density  functions  (pdf’s)  are  estimated  and  used  to  find  where  in 
R^  a  given  class  tends  to  cluster.  In  some  cases,  these  pdf’s  may  also  be  used  to  generate 
class  boundaries  and  indicate  class  separability.  When  an  unknown  signature  (exemplar)  is 
presented  to  the  classifier,  the  classifier  “guesses”  the  proper  class  assignment  based  on  the 
estimated  parameters  of  the  underlying  statistical  distributions. 

Theoretically,  the  limit  to  this  success  rate  is  known  as  the  Bayes  error  rate  (41:35). 
The  Bayes  decision  rule  tells  one  where  to  draw  the  decision  boundary  lines  in  the  feature 
space.  Boundaries  between  classes  are  formed  and  used  to  classify  unknown  observations.  In 
the  case  of  parametric  classifiers,  the  boundary  may  be  estimated  analytically.  One  recently 
developed  algorithm  uses  information  derived  from  the  estimated  decision  boundary  itself  to 
determine  a  new  set  of  relevant  features.  These  topics  are  discussed  in  the  first  half  of  this 
chapter.  Also  included  is  a  brief  introduction  to  wavelets,  which  will  be  used  as  a  tool  in 
demonstrating  frequency  analysis  in  a  two  class  case. 
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With  the  Adaptive  Gaussian  Classifier  (AGC)  algorithm  the  assumption  is,  naturally, 
the  distributions  of  the  data  are  Gaussian.  For  the  purposes  of  this  thesis,  AGC  refers  to  the 
entire  Hughes  algorithm,  including  preprocessing  and  classification.  UHRR  radar  signatures 
are  projected  into  a  192-dimensional  feature  space,  where  the  complex  return  in  each 
range  bin  represents  a  feature.  This  strategy  is  validated  by  Fukunaga,  who  has  shown  the 
best  you  can  ever  do,  classification  wise,  is  with  the  raw  data  (19).  Any  processing  or  filtering 
ultimately  removes  some  information  from  the  data.  Several  preprocessing  steps  are  taken 
to  put  the  data  into  usable  form.  The  AGC  takes  the  data,  aligns  it  to  class  template  vectors, 
and  classifies  signatures  with  a  Gaussian  classifier.  The  last  portion  of  this  chapter  details  the 
steps  in  that  process. 

2.2  Bayes  Decision  Rule 

2.2.1  Fundamentals.  The  foundation  of  statistical  pattern  recognition  rests  on 
Bayes  rule  which  is  expressed  mathematically  as  (19,  41) 

p{x\uJi)  •  P{uJi) 

M 

Bayes  rule  combines  information  about  the  a  priori  probability  of  a  class,  with  the  total 

probability,  p{x),  of  a  measurement  value,  x,  and  the  conditional  probability  that  x  belongs  to 
LOi'.  p{x\u}i).  The  a  posteriori  pdf,  p{u}i\x)  expresses  the  probability  that  the  unknown  sample 
belongs  to  class  Ui,  given  a  measurement  x. 

The  pdf  for  a  continuous  random  variable  x,  when  integrated  between  two  limits,  gives 
the  probability  that  x  will  takes  on  a  value  between  those  limits.  The  pdf  for  a  discrete  random 
variable  yields  a  similar  value  when  the  pdf  is  summed  across  the  indices  included  by  the 
limits.  The  right  hand  side  of  Equation  1  includes  three  terms.  P{(jJi)  is  the  a  priori  probability 
that  class  uJi  occurred.  That  is,  for  an  infinite  number  of  trials,  it  is  the  expected  frequency 
that  u)i  has  appeared.  The  term  p{x\ui)  is  the  conditional  pdf  yielding  the  probability  that 
X  is  observed,  given  that  the  class  occurred  (Wj,  t  G  {1,2,...K}  and  K  is  the  number 
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of  classes).  p{x)  is  the  overall  pdf  for  a  measurement  x:  in  other  words,  a  measure  of  the 
expected  value  of  x.  Note  that 

L 

p{x)  =  Y^p{x\uJi)  ■  P{ui),  (2) 

i=l 

which  says  that  the  overall  pdf  for  x  equals  the  sum  of  each  class’s  conditional  pdf  multiplied  by 
the  a  priori  probability  for  that  class.  One  term  of  this  summation  (the  class’s  contribution 
to  the  sum)  is  used  in  Equation  1 .  The  ratio  of  that  portion  of  the  summation  to  the  overall  pdf 
of  X  gives  the  a  posteriori  probability  of  class  Ui. 

Bayes  decision  rule  assigns  an  unknown  measurement,  x,  to  the  class  with  the  highest 
a  posteriori  value.  This  criterion  insures  that  the  probability  of  assigning  an  exemplar  to  the 
wrong  class  is  minimized  and  is  easily  extended  to  the  multivariate  case  where  x  need  not  be 
a  one-dimensional  vector,  but  could  be  any  N-dimensional  vector,  x  6  ,  containing  the 

features  extracted  by  some  measurement. 

For  a  two-class  problem,  the  probability  of  error  is  equal  to  the  chance  that  the  classifier 
guesses  the  wrong  class,  given  a  value  of  x.  Mathematically,  one  integrates  the  x  dependence 
out  of  the  conditional  probability: 


/OO 

P{eTroT\x)p{x)dx,  (3) 

-OO 

to  get  the  expected  probability  of  error  for  a  data  set.  To  account  for  errors  over  both  classes, 
substitute  expressions  representing  the  probabilities  of  making  an  error  for  each  class  into 
Equation  3  (19), 


P (error)  =  /  p{x\oJi)P(ui)dx+  /  p(x\uj2)P{(jJ2)dx. 
JS2  JSi 


(4) 


where  the  5’s  represent  the  respective  decision  regions  for  each  class.  In  words,  it  adds  up  the 
probability  of  an  exemplar  occurring  in  the  wrong  decision  region.  Note  that  the  dependence 
on  p{x)  has  divided  out.  As  stated  in  Fukunaga  (19),  this  is  the  minimum  Bayes  error  rate 
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for  any  classifier.  Equation  4  expresses  mathematically  the  rule  of  calculating  the  area  under 
the  tails  of  the  pdf’s  for  finding  the  expected  probability  of  error.  See  Schalkoff  (41:35)  for 
further  details. 

2.2.2  Functional  Approximations  of  pdf's.  The  actual  underlying  pdf  of  a  random 
process  (such  as  a  UHRR  radar  signature)  is  rarely  known.  The  conditional  pdf,  p{x\uJi),  for 
a  class,  Ui,  is  estimated  or  parameterized,  based  on  samples  from  that  class.  There  are  two 
approaches  for  constructing  the  conditional  pdf:  parametric  and  non-parametric. 

2. 2. 2.1  Parametric.  Parametric  approximations  assume  a  functional  form 
for  the  pdf  and  calculate  the  function’s  parameters  based  on  a  set  of  sample  points,  called  the 
training  set.  The  most  significant  example  is  the  Gaussian  pdf.  The  Gaussian  is  important 
because  many  natural  processes  tend  to  have  Gaussian  distributions.  The  central  limit  the¬ 
orem  states  that  the  statistics  of  a  system,  no  matter  what  the  underlying  random  variables 
contributing  to  the  process,  becomes  Gaussian  as  the  number  of  underlying  variables  goes  to 
infinity  (19: 17).  The  central  limit  theorem  presumes  that  the  process  is  a  linear  combination 
of  the  underlying  random  variables.  Note  the  Gaussian  case  assumes  a  unimodal  pdf,  which 
may  or  may  not  be  the  true  “shape”  of  the  actual  pdf.  This  assumption  will  cause  problems  if 
it  does  not  reflect  the  true  distributions  of  the  data. 

The  parameters  that  completely  define  a  one-dimensional  Gaussian  pdf  are  the  mean 
(/x)  and  variance  (cr^)  of  the  associated  random  variable.  Each  is  estimated  using  the  training 
set  and  the  training  procedure  forms  an  estimate  of  the  pdf  of  the  actual  random  process. 
These  estimates  should  be  “unbiased”  so  that  the  random  process  which  generates  the  data  is 
properly  characterized.  An  estimate  is  unbiased  when  p  and  converge  in  a  mean  square 
sense  to  the  expected  values  of  p  and  cr^  when  J  — >  oo,  where  J  is  the  number  of  samples 
used  in  training.  In  other  words,  the  estimates  should  converge  to  the  actual  values  as  one 
uses  more  and  more  training  samples. 

To  form  these  estimates,  suppose  the  training  data  consists  of  measurements  Xj  or  Xj, 
where  j  is  the  sample  from  a  set  of  J  samples.  For  one  dimension,  unbiased  estimates  of 
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For  future  reference,  note  that  this  equation  is  an  averaged  sum  of  outer  products.  The 
covariance  matrix,  Ej,  gives  the  statistical  interrelationship  between  the  feature  elements 
(37:427-455).  Diagonal  elements  are  variance  estimates.  If  all  off-diagonal  elements  are  zero, 
the  features  are  statistically  independent. 

The  exponent  term  on  the  end  of  Equation  8  is  an  important  one  and  is  known  as 
the  Mahalanobis  distance  between  the  measured  data  x  and  the  estimated  mean  of  class  uJi. 
Unlike  a  Euclidean  distance,  the  Mahalanobis  distance  normalizes  with  respect  to  variance 
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that  feature  axis  in  R^.  Thus,  distances  between  features  which  have  widely  varying  values 
are  normalized  to  distances  between  features  with  more  “concentrated”  pdf’s. 

2.22.2  N on-parametric.  Non-parametric  approximations  assume  no  form 
for  the  pdf  function  as  a  whole.  Instead,  the  pdf  is  estimated  directly  from  the  data.  One  type 
of  non-parametric  classifier  is  the  k-NN,  where  the  idea  is  that  the  pdf  will  be  larger  where 
more  samples  tend  to  appear  and  smaller  where  fewer  samples  appear.  Although  specific 
definitions  may  be  argued,  another  type  of  non-parametric  technique  is  the  neural  network. 
Martin  gives  an  excellent  description  of  estimating  pdf’s  with  these  techniques  (32,  33). 

The  k-nearest  neighboring  technique  is  a  common  non-parametric  density  estimate. 
As  developed  by  Fukunaga  (19:268),  a  random  hyper-volume,  V,  is  created  around  a  given 
training  vector,  x.  This  volume  expands  until  the  nearest  neighbor  matching  x's  class,  iOu 
is  found.  Thus,  V  is  a  random  function  of  x  and  is  smaller  in  regions  where  samples  of  uix 
are  most  dense.  This  is  an  inverse  relationship  which  may  be  expressed  as 

(11) 

where  N  is  the  total  number  of  samples  and  k  is  the  number  of  nearest  neighbors  used  when 
forming  V.  Any  distance  metric  may  be  used  to  judge  which  neighbors  are  closest  to  x. 

Artificial  neural  networks  may  also  be  used  to  estimate  the  a  posteriori  probabilities 
of  the  classes  in  a  pattern  recognition  problem.  The  term  “neural  network”  originates  from 
the  behavior  of  neurons  in  the  brain,  but  the  association  is  oblique,  at  best.  A  brief  overview 
of  neural  networks  is  included  in  this  thesis.  This  overview  is  in  the  same  form  as  found  in 
Martin  (32),  and  further  details  may  be  found  in  Lippmann  (30)  and  Rogers  (39). 

A  neural  network  consists  of  a  matrix  of  nodes  which  are  interconnected  by  weighted 
input  and  output  paths.  The  weights  are  adjusted  during  training,  when  they  are  modified  to 
force  the  network  to  yield  a  desired  output  for  a  given  set  of  training  vectors.  A  basic  neural 
network  architecture,  known  as  a  multi-layer  perceptron,  is  shown  in  Figure  1.  Using  the 
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standard  terminology,  this  particular  network  has  one  hidden  layer  with  five  hidden  nodes. 
Only  four  of  the  weights  (represented  by  the  interconnecting  lines)  are  labelled,  but  each 
connection  has  a  weight  associated  with  it.  There  are  three  input  nodes  and  two  output  nodes. 
Also  shown  are  bias  nodes,  represented  as  the  I’s  enclosed  in  circles.  These  nodes  are  an 
integral  part  of  the  system  and  have  weights  associated  with  their  interconnections  like  all  the 
other  nodes. 

Hidden  Nodes 

Input  Nodes  Output  Nodes 


Figure  1.  Example  of  a  Multi-Layer  Perceptron 


To  use  a  neural  network  as  a  classifier,  one  may  require  that  Zi  be  “high”  relative  to  Z2 
when  a  member  of  class  wi  is  presented  at  the  input  nodes.  One  may  have  192  input  nodes 
and  four  output  nodes  for  the  four-class  UHRR  radar  problem.  The  number  of  hidden  nodes  is 
fairly  flexible,  but  depends  on  the  complexity  of  the  decision  regions  required  to  separate  the 
data.  Usually,  one  must  vary  the  number  of  hidden  nodes  as  a  parameter  and  simply  see  how 
results  are  affected.  This  problem  is  similar  to  the  problem  of  choosing  the  proper  number 
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of  nearest  neighbors  to  use  in  the  A:-NN  algorithm.  See  Martin  (33)  and  Fukunaga  (20)  for 
discussions  of  this  issue. 

During  learning,  training  vectors  are  presented  to  the  neural  network  inputs.  Each  input 
is  weighted  by  wfj  and  then  presented  as  addends  to  the  hidden  nodes,  k  is  the  layer,  i 
is  the  input  to  that  node,  and  j  is  the  node  at  the  layer.  The  weights  are  initially 
randomized.  Each  hidden  node  implements  a  non-linear  function  known  as  a  sigmoid.  Other 
functions  may  be  used,  but  sigmoids  have  attractive  characteristics  which  are  mentioned  in 
the  literature  (39).  Mathematically,  for  each  node, 

M  =  (12) 

where  a  is  a  weighted  sum  of  outputs  from  the  previous  layer.  Weights  are  usually  updated  by 
a  process  known  as  “backpropagation.”  This  gradient  descent  technique  seeks  to  minimize  an 
error  measure  for  a  given  training  exemplar  and  set  of  weights.  A  common  error  measure  is  the 
square  error  between  desired  and  actual  outputs  for  a  given  output  node.  As  updated  values  of 
the  weights  are  generated,  the  neural  network  is  conditioned  to  provide  the  appropriate  outputs 
for  a  given  class  at  the  input  nodes. 

As  shown  by  Ruck  (40),  a  neural  network  trained  in  this  way  approximates  the  under¬ 
lying  a  posteriori  probability  function.  Rogers  shows  how  the  internal  representations  of  the 
weights  may  be  combined  and  interpreted  as  decision  boundaries  for  classification  (39:58). 
The  bias  nodes  allow  the  decision  boundaries  to  not  necessarily  pass  through  the  origin.  Steppe 
provides  a  rigorous  method  for  determining  salient  features  from  classification  by  a  neural 
network  (42,  43). 

2.2.3  Likelihood  Ratios.  Fukunaga  discusses  the  use  of  pdf  estimates  to  make 
optimal  decisions  in  a  given  pattern  recognition  problem  (19:5 1).  The  following  development 
parallels  his.  Decisions  in  statistical  pattern  recognition  depend  upon  the  pdf  estimate  used, 
the  training  data  available,  and  the  distance  metric  used  to  measure  the  similarities  between 
test  samples  and  templates.  These  dependencies  require  the  researcher  to  be  careful  in  setting 
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up  experiments.  The  three  dependencies  mentioned  here  parallel  the  questions  to  be  explored 
in  Chapter  3  of  this  thesis. 

One  chooses  the  class  of  maximum  likelihood  in  a  two  class  problem  by  assigning  a 
test  vector  to  a  class.  Using  Equation  1, 

p{x\ui)  •  ‘ii  p{x\u2)  ■  P{UJ2) 

- r -  ^  - y—r - . 

P[X)  “^2  p[X) 

The  above  equation  says  “decide  class  1  if  the  left  hand  side  is  greater  than  the  right  hand  side 
of  the  equation  and  decide  class  2  if  the  left  hand  side  is  less  than  the  right  hand  side  of  the 
equation.”  Because  it  is  a  pdf,  p(x)  is  by  definition  always  positive  and  is  common  to  both 
sides  of  the  equation,  so  it  divides  out,  leaving 


U)l 

p(x\uJi)  •  P{u)i)  p{x\uj2)-PM  (14) 

Pfekl)  ^(^2) 

p{x\ljJ2) 


with 


£(x) 


PixM ' 


(16) 


C  is  known  as  the  “likelihood  ratio”.  Assuming  Gaussian  distributions,  versions  of 
Equation  8  (corresponding  to  the  two  classes  under  test)  are  inserted  into  Equation  15  and  a 
natural  logarithm  is  taken  to  remove  the  exponential  and  yield 


.b{x-p^ft2^{x-p^) 


>  It, 

^  In  TTT — r 

‘^2  ■P(wi) 


(17) 

(18) 
(19) 
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(20) 


-ln£(a:)  ? 

p[u}i) 

This  equation  represents  a  quadratic  surface  in  which  separates  the  two  classes.  Using  this 
equation  as  a  classifier  presumes  unimodal  Gaussian  distributions,  but  is  analytically  tractable. 

The  decision  rule  or  discriminant  function,  h{x)  is  defined  to  be  (29): 

h{x)  =  —  ln£(^).  (21) 


The  minus  sign  introduced  here  cancels  with  the  minus  signs  in  the  left  hand  portion  of 
Equation  1 8 ,  and  the  inequalities  change  directions  by  exchanging  numerator  and  denominator 
on  the  right  hand  side  of  the  equation.  These  steps  proceed  as  follows: 


h,(x)  ^ 

/i(x)  $ 

^  '  (J2 

Ct>l 

Hi)  }, 


In 


In 


PM 

PM 

PM 

PM 


(22) 

(23) 

(24) 


The  parameter  t  is  known  as  the  decision  threshold.  If  a  prioris  are  assumed  equal,  t 
becomes  zero.  The  discriminant  function,  h{x)  has  a  nice  intuitive  appeal  because  it  implies 
that  X  “probably  belongs”  to  the  nearest  class  template  using  the  Mahalanobis  distance  metric, 
plus  an  additional  term.  This  additional  term  contains  the  ratio  of  the  determinants  of  the 
covariances  and  biases  the  decision  in  favor  of  the  class  with  the  tighter  distribution.  Thus, 
the  determinant  term  will  be  referred  to  as  the  “bias”  for  the  remainder  of  the  thesis.  Strang 
shows  how  the  determinant  can  be  interpreted  as  a  volume  in  A/'-dimensional  space  (45).  If 
a  class  is  spread  out  or  tends  to  “occupy”  a  significant  volume  in  R^,  that  class  is  penalized. 
For  a  K-class  problem,  one  takes  the  minimum  of  the  K  interclass  distances  (scaled  by  the 
bias)  in  making  a  decision.  Note  that  in  Fukunaga  (19),  the  decision  rule  has  a  typographic 
error  throughout  the  book  because  the  greater  than/less  than  symbol  is  reversed  from  what  it 
should  read  (for  example,  see  Page  125,  Equation  (4.1)). 
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h{x)  may  also  be  computed  in  the  case  of  A:-NN’s.  The  following  equations  are  taken 
from  Martin’s  thesis  (32:14-15)  and  are  implemented  later  in  his  code.  Use  Equation  11  and 
insert  the  equation  for  V: 

V  =  (25) 

In  this  equation,  is  the  square  root  of  the  Mahalanobis  distance  between  the  exemplar  and 
the  class  mean  being  tested  against.  Vd  is  a  hyperellipsoid  of  N  dimensions  which  is  given  by 


= 

(26) 

2n.^(n-l)/2 

Vd  = 

- ,  ^  ,  n  odd. 

ni 

(27) 

These  equations  yield 

and  may  be  inserted  into  Equation  15. 

Bayes  decision  rule,  developed  above,  is  the  core  of  statistical  pattern  recognition.  As 
stated  by  Fukunaga,  “the  probability  of  error  is  the  most  effective  measure  of  a  decision  rule’s 
usefulness  (19:85),”  and  Equation  4  is  the  equation  giving  that  value.  Unfortunately,  that 
integral  can  be  analytically  intractable  in  high  dimensional  problems  even  with  the  Gaussian 
assumption.  To  overcome  this,  Fukunaga  suggests  using  a  Monte  Carlo  analyses  or  bounding 
the  error  probabilities  (19:85).  In  this  thesis,  both  techniques  are  utilized. 

In  any  case,  one  is  estimating  the  predicted  error  rate  for  the  classifier.  In  effect,  the 
estimate  itself  is  a  random  variable  and  has  some  variance  associated  with  it.  This  uncertainty 
may  be  quantified  with  the  use  of  confidence  intervals.  When  applying  a  confidence  interval  to 
an  error  rate,  one  is  estimating  the  variance  of  a  proportion,  where  the  proportion  is  a  measure 
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of  the  expectation  of  a  correct  classification.  The  proportion  behaves  as  a  Bernoulli  random 
variable  with  a  binary  output:  correctly  classified  or  not  correctly  classified  (13:347). 

Any  statistics  book  will  give  the  basics  of  this  procedure,  but  Papoulis  (37:241-256,270) 
gives  the  complete  story,  if  one  is  careful  to  follow  his  cross-references.  The  technique  treats 
the  classification  rate  produced  from  one  Monte  Carlo  trial  as  an  estimate  of  the  mean  of  a 
Bernoulli  process.  As  more  estimates  of  the  mean  are  taken,  one  becomes  more  confident  that 
the  actual  mean  is  “close  to”  this  value. 

To  produce  a  confidence  interval,  one  requires  a  test  statistic,  Zaj-i-  The  degree  of 
confidence  is  “1  —  a  ”,  in  terms  of  a  percentage.  As  shown  in  Figure  2,  the  goal  is  to  find 


Figure  2.  Confidence  Interval  Schematic 


the  extent  of  the  shaded  region  of  the  curve,  which  represents  the  density  function  of  the 
Bernoulli,  or  binomial,  random  variable.  In  this  case,  the  region  shown  may  represent  where 
97.5  %  of  the  mean  estimates  are  expected  to  fall,  after  normalizing  the  mean  to  zero.  This 
implies  a/2  equals  0.0125.  The  value  of  2:0/2  may  be  obtained  from  standard  tables  and  for  a 
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97.5%  confidence  level,  Zai2  —  1-96  (2:624).  The  variance  is  estimated  with 


^2  ^  P(1  -  P) 
n 


(29) 


where  p  equals  the  average  classification  rate  for  n  trials.  The  following  equations  are  used  to 
find  the  lower  and  upper  bounds  on  the  estimate: 


P  -Za/2  < 


p-p 


^p{l-p)l 


n 


<  Za/2  =  1  -  a 


P  p-Za/2] 


=  1  —  a 


n 


n 


(30) 


2.3  Bayes’  Bounding 

According  to  Fukunaga,  the  estimation  of  a  bound  on  the  Bayes’  error  rate  is  accom¬ 
plished  by  using  a  Leave-One-Out  (L)  and  Resubstitution  (R)  analysis  (described  in  the  test 
below).  The  overriding  concern  in  dividing  up  a  finite  set  of  data  is  maintaining  independence 
between  training  and  test  sets.  L,  where  just  one  data  point  is  withheld  from  training  and 
used  for  testing,  is  a  special  case  of  the  more  general  holdout  method  where  some  fraction  of 
the  data  is  used  for  training  and  the  rest  for  testing.  As  long  as  testing  and  training  sets  are 
independent,  “the  [holdout]  and  L  methods  are  supposed  to  give  very  similar,  if  not  identical, 
estimates  of  the  classification  error  (19:221).” 

Devijver  and  Kittler  give  an  excellent  discussion  of  these  issues  (13:343).  They  em¬ 
phasize  the  fact  that  if  one  withholds  most  of  a  limited  data  set  for  training,  leaving  few  for 
testing,  one  can  have  little  confidence  in  the  test.  At  the  same  time,  if  one  reserves  most  of  a 
limited  data  set  for  testing,  it  is  not  a  good  classifier  design  with  respect  to  the  statistics  of  the 
data.  Also,  Devijver  and  Kittler  note  that  the  hold  out  method  tends  to  overestimate  the  actual 
error  rate  and  gives  cross-validation  as  a  reasonable  compromise  in  allocating  an  limited  data 
set  to  training  and  testing  subsets  (13:355). 
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The  disadvantage  of  the  holdout  method  is  that  a  tradeoff  exists  between  the  bias  and 
variance  of  the  estimated  error  rate.  A  full  discussion  of  these  effects  is  found  in  Fukunaga  and 
Lachenbruch  (27: 1-1 1).  Geman  also  discusses  the  impact  of  bias  and  variance  on  results  (21). 
Bias  reflects  the  accuracy  of  the  classification  results  for  a  given  experiment  with  respect  to 
the  accuracy  that  would  be  attained  in  a  “real  world  problem.”  Vapnik  gives  a  measure  of  the 
maximum  deviation  of  estimated  error  rates  firom  the  true  error  applied  to  the  real  world  (47). 
Jain  paraphrases  “Uncle  Bemie’s  Rule,”  referring  to  Bernhard  Widrow’s  rule  of  thumb  that 
one  needs  about  five  to  ten  times  as  many  samples  per  class  as  classifier  free  parameters  in  a 
pattern  recognition  problem  (23:204).  Finite  data  sets  in  high  dimensional  situations  tend  to 
be  biased  if  the  nature  of  the  statistics  is  not  captured  by  the  data  available.  Over  many  runs, 
the  estimates  are  unbiased  if  the  estimates  converge  to  the  expected  value.  Variance  relates  to 
the  consistency  of  the  estimated  classification  rate  over  different  sets  of  data. 

A  Bayes  bounding  analysis  is  implemented  by  Martin  using  k-Nearest  Neighbors  (k- 
NN),  Parzen  window,  and  neural  network  pdf  estimates  (32).  In  essence,  the  idea  is  to  find  an 
overly  optimistic  estimate  of  the  Bayes  error  by  training  and  testing  on  identical  data  sets  (the 
R  case).  Likewise,  the  L  method  gives  a  slightly  pessimistic  estimate  of  the  Bayes  error.  With 
respect  to  the  Bayes  error,  the  L  value  represents  an  upper  bound.  The  premise  behind  the  R 
method  is  that  a  classifier’s  performance  can  never  be  better  than  when  the  classifier  sees  the 
answers  before  answering  the  question.  In  the  L  method,  if  the  data  set  contains  M  samples, 
M  trials  are  made,  sequentially  withholding  one  test  vector.  In  each  trial,  the  classifier  trains 
on  M  —  1  samples  and  tests  on  one  sample.  The  total  number  of  misclassifications  across  the 
M  trials  is  used  to  estimate  the  error  rate. 

Martin’s  work  varied  the  parameters  associated  with  the  pdf  estimators  (window  size 
for  Parzen,  number  of  nearest  neighbors  for  k-NN,  and  number  of  hidden  nodes  for  neural 
networks)  and  adjusted  the  decision  threshold  accordingly.  It  had  been  found  that  the  decision 
threshold  plays  a  crucial  role  in  the  effectiveness  of  the  Bayes  error  estimates  (20).  As 
summarized  by  Martin,  when  the  L  method  is  employed,  the  threshold  should  be  adjusted 
to  remove  any  influence  the  sample  under  test  may  have  had  in  estimating  the  pdf.  This 
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method  for  selecting  the  threshold  is  called  “Option  2”  by  Martin  and  is  utilized  for  the  results 
generated  in  this  thesis  (33). 

2.4  Feature  Space  Utilization 

Efficient  use  of  the  feature  space  is  an  important  element  in  engineering  a  pattern 
recognition  problem.  The  desire  to  limit  the  dimensionality  of  the  problem  is  driven  by  several 
concerns  because  high-dimensional  data  causes  several  problems  in  pattern  recognition.  Not 
only  does  the  bias  of  estimates  become  significant  (19:316-317),  but  also  high  dimensional 
situations  require  more  sample  points  to  adequately  represent  the  statistics  of  the  problem.  The 
latter  circumstance  was  explored  in  Foley  (18)  who  studied  the  impact  of  using  the  same  data 
set  for  both  training  and  testing.  Foley  presents  the  ratio  of  samples  per  class  to  the  number 
of  features  utilized  as  the  key  parameter.  For,  say,  a  quadratic  classifier  in  a  192-dimensional 
problem,  using  a  common  training  set  and  test  set  is  not  acceptable. 

Another  important  issue  is  the  “curse  of  dimensionality.”  It  is  possible  to  include  too 
many  features  in  a  problem  because  the  information  added  may  be  redundant  information  or 
noise  (16:67).  Jain  points  out  that  each  added  feature  must  increase  the  separation  between 
classes  by  %  (in  terms  of  Mahalanobis  distance)  to  avoid  a  roll-off  in  classification 

accuracy.  In  this  equation,  n  is  the  number  of  samples  and  d  is  the  dimensionality  of  the 
problem.  Duda  (16:77),  citing  Chandrasekaran  (5),  states  that  the  inclusion  of  more  and  more 
dimensions  will  not  affect  performance  if  the  features  are  truly  statistically  independent. 

Foley  cites  other  authors  whose  results  emphasize  the  negative  effects  of  limited  sample 
size,  including  Cover  (11),  Estes  (17),  and  Kanal  (24).  The  underlying  thrust  of  these  articles 
is  that  the  more  information  one  has  on  the  underlying  behavior  of  the  statistics  of  a  random 
process,  the  better  the  performance  will  be.  At  the  same  time,  one  would  like  to  find  a  minimal 
representation  of  the  statistics  of  the  data.  When  data  is  limited,  which  it  usually  is,  one  must 
develop  techniques  (such  as  the  leave-one-out  method)  designed  to  estimate  the  performance 
of  the  classifier. 
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2.4.1  Projections  and  Transformations.  When  a  researcher  takes  measurements, 
one  is  projecting  a  continuous  natural  phenomenon  into  a  discrete  representation  in  a  feature 
space,  ,  where  each  feature  represents  a  value  along  an  axis  in  that  space.  This  process 
can  be  visualized  in  the  context  of  linear  algebra,  with  the  allowance  that  visualizing  in  192 
dimensions  can  be  difficult.  Together,  the  N  axes  form  a  basis  set  for  ,  which  may  or  may 
not  be  linearly  independent. 

As  an  example,  a  continuous  EM  signal,  such  as  a  radar  signature,  may  be  projected 
into  a  feature  space  whose  axes  correspond  to  complex  returns  in  range  bins.  Several  items 
related  to  the  physics  of  this  problem  bound  it  in  a  mathematical  sense.  Radar  signatures  are 
discrete,  finite  duration  representations  of  physical  processes  and  are  thus  finite  energy.  R^ 
is  assumed  to  be  a  Hilbert  space,  which  has  the  the  attractive  property  of  being  a  complete, 
inner  product  space.  This  property  is  important  for  several  reasons.  First,  it  means  that  R^ 
is  a  linear  space  and  one  may  find  linear  manifolds  (closed  subspaces  of  R^)  in  it.  Also,  it 
means  that  projections,  via  inner  products,  into  those  linear  manifolds  may  be  accomplished 
and  that  consistent  distance  metrics  can  be  defined.  The  fact  that  R^  is  a  linear  space  implies 
rotations  do  not  change  the  relative  positions  of  points  in  that  space. 

Once  a  signal  is  projected  and  thereby  vectorized  into  R^ ,  one  can  perform  further  linear 
projections  and/or  rotations  to  find  efficiencies  inherent  to  the  data  and  classifier  (19).  An 
important  strategy  is  to  use  the  transformations  to  modify  the  coordinates  of  the  feature  space, 
exploiting  those  features  which  yield  the  most  information  about  class  separation  (38: 182).  As 
long  as  the  transformations  are  linear  transformations,  the  underlying  statistics  are  preserved. 

Parsons  focuses  on  two  major  methods  for  improving  classification  through  transforma¬ 
tions  on  R^\  decorrelating  features  (causing  the  correlation  matrix  to  become  non-zero  only 
on  the  main  diagonal)  and  maximizing  separability.  One  method  to  decorrelate  features  is 
via  the  Karhunen-Lo6ve  transform  (KLT).  The  KLT  essentially  removes  dependencies  among 
linearly  dependent  features  and  forms  a  minimal  representation  of  the  data.  Thus,  R^  is 
transformed  such  that  R^  — >  ,  and  possesses  an  orthogonal  basis  set.  The  steps  for 

accomplishing  the  KLT  are  given  in  Parsons,  mathematical  details  may  be  found  in  Fukunaga 
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(19:405-409).  This  linear  transformation  forms  a  new  set  of  features  which  may  be  viewed  as 
a  minimal  representation  of  the  data  in  a  mean  square  error  sense.  As  pointed  out  later  in  this 
thesis,  however,  the  KLT  is  not  always  the  most  successful  with  respect  to  classification  error. 

The  essence  of  the  KLT  technique  is  to  use  the  eigenvectors  of  the  covariance  matrix 
as  an  orthogonal  basis  of  the  new  space.  The  technique  yields  an  orthonormal  basis  in 
because  eigenvectors  are  orthogonal  and  energy  normalized.  The  corresponding  magnitudes 
of  the  eigenvalues  indicate  the  relative  “importance”  of  each  dimension.  The  beauty  of  the 
KLT  is  that  the  error  in  reconstruction  produced  by  eliminating  an  eigenvector  is  directly 
related  to  the  magnitude  of  the  associated  eigenvalue  (19:410).  Also,  the  KLT  does  not  affect 
the  underlying  statistics  of  the  problem  because  it  is  a  linear  transformation.  Fukunaga  extends 
this  point  by  stating  that  the  “class  separability,  for  example  the  probability  of  error  due  to  the 
Bayes  classifier,  is  invariant  under  any  nonsingular  transformations  (19:417).”  This  fact  leads 
to  the  ability  to  use  other  rotations,  as  the  one  found  in  Lee’s  article,  and  projections,  such  as 
a  multiresolution  analysis. 

This  is  all  well  and  good,  but  to  improve  classification  accuracy  for  a  classifier,  the 
real  goal  is  to  maximize  the  separability  of  the  data.  The  KLT  decorrelates  features,  but  does 
not  necessarily  indicate  which  features  may  yield  more  information  about  class  separation 
than  others.  The  KLT  does  indicate  in  what  directions  the  data  as  a  whole  tends  to  be 
spread.  As  pointed  out  by  Parsons  (38:185),  the  KLT  finds  a  set  of  axes  along  directions  of 
maximum  variance  which  implies  the  directions  of  maximum  separability.  This  strategy  does 
not  necessarily  work  in  all  cases  (29,  38)  but  the  door  swings  open  on  discriminant  analysis. 

Other  feature  discrimination  techniques  include  the  Fisher  discriminant  and  the  Fuku¬ 
naga  and  Koontz  method.  Like  the  KLT,  both  of  these  methods  seek  out  the  dimensions  in 
that  have  the  most  variance.  These  techniques  and  several  others  are  cited  and  briefly 
summarized  in  Lee  (29:389).  Lee  points  out  that  high  dimensional  data  restricts  the  ability  of 
these  algorithms  to  perform  in  a  computationally  cost-effective  manner.  Also,  if  class  means 
are  close  to  one  another,  the  results  may  not  be  meaningful. 
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2.4.2  Discrimination  Using  Decision  Boundaries.  As  mentioned  earlier  in  this 
chapter,  a  decision  boundary  h{x),  separates  two  classes  in  the  A^-dimensional  feature  space. 
For  a  quadratic  Gaussian  classifier,  this  equation  is  given  analytically  by  combining  Equa¬ 
tions  18  and  24: 

h{x)  = 

-  .5(£- ^2)^22  + -Sin  1^.  (31) 

1^2! 

The  decision  is  class  cui  if  h{x)  <  t,  where  f  =  0  for  equiprobable  a  prioris.  This  equation 
represents  an  A^-dimensional  surface  and  is  the  foundation  for  all  decisions  in  that  pattern 
recognition  problem. 

A  recent  pair  of  articles  by  Lee  and  his  co-author  Landgrebe  (28,  29)  exploit  this 
idea  by  transforming  based  on  the  orientation  of  the  most  relevant  part  of  the  decision 
boundary.  Their  algorithm  may  be  used  to  highlight  more  relevant  features  (“discriminantly 
relevant  features”)  and  leave  out  redundant  features  (“discriminantly  redundant  features”). 
The  authors  characterize  this  transformation  as  an  improvement  over  the  KLT  with  respect  to 
classification.  While  the  KLT  minimizes  mean  square  reconstruction  error  based  on  statistics 
of  the  feature  vectors,  it  is  not  necessarily  optimum  in  the  sense  of  class  separability.  Lee’s 
new  approach  is  “based  on  the  decision  boundaries  directly...and  [predicting]  the  minimum 
number  of  features  needed  to  achieve  the  same  classification  accuracy  as  in  the  original  space 
for  a  given  problem  and  [finding]  the  needed  feature  vectors  (29:389).”  That  technique  centers 
on  extracting  a  basis  set  from  the  “effective  decision  boundary  feature  matrix”  (EDBFM).  The 
algorithm  and  its  implementation  are  discussed  next. 

Lee  uses  the  same  notation  as  in  this  paper  but  adds  a  number  of  definitions  to  support 
the  theorems  and  development  of  his  argument.  His  key  argument  is  to  find  a  linear  manifold 
of  called  .  R^  is  spanned  by  basis  vectors  /?*,  z  =  1 . . .  iV,  is  spanned  by  basis 
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vectors  ^j,  j  =  1,2,...  M,  and  M  <  N.  In  Wm,  the  following  equation  holds  for  all 

{h{x)  -  t){h{x)  -t)  >  0.  (32) 

where  x  (M-dimensional)  is  the  representation  in  of  x  (A^-dimensional).  Lee  states  the 
physical  meaning  of  this  equation  is  that  none  of  the  decisions  will  change  in  the  subspace 
(29:390).  If  the  vector  is  correctly  classified  in  both  spaces  then  the  product  is  always  positive. 
If  the  vector  is  correctly  classified  in  and  incorrectly  classified  in  ,  then  the  product 
will  be  negative.  The  problem  is  to  find  the  basis  vectors  in  the  original  space  that  are  irrelevant 
to  all  decisions  for  all  data  points.  In  other  words,  one  is  seeking  dimensions  parallel  to  the 
decision  boundary  at  all  points.  The  following  theorem  lays  the  foundation  for  Lee’s  approach: 

Theorem  1  If  a  vector  is  parallel  to  the  tangent  hyperplane  to  the  decision  boundary  at  every 
point  on  the  decision  boundary  for  a  pattern  classification  problem,  then  the  vector  contains 
no  information  useful  in  discriminating  classes  for  the  pattern  classification  problem,  i.e.,  the 
vector  is  discriminantly  redundant  (29:391 ). 

He  ends  up  concluding  “the  effectiveness  of  the  basis  vector  is  roughly  proportional  to  the 
area  of  the  decision  boundary  that  has  the  same  normal  vector  (29:392).’’  This  leads  directly 
to  the  mathematical  equation: 

'^DBFM  =  ^  j^N{x)N^{x)p{x)dx.  (33) 

In  this  equation,  iV  represents  the  normal  vector  at  point  x  to  the  decision  boundary;  S 
represents  the  decision  boundary  surface;  and  K  —  Jg  p{x)dx. 

The  advantage  of  the  Lee  technique  becomes  more  clear  when  a  new  surface  is  formed: 
s',  defined  as  the  “effective  decision  boundary.”  S'  represents  that  portion  of  the  decision 
boundary  which  separates  most  of  the  exemplars  in  the  problem.  Often,  a  very  complicated 
decision  surface  can  be  simplified  to  a  linear  equation.  Basically,  one  is  removing  outliers 
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from  the  problem,  betting  that  future  outliers  will  occur  infrequently.  This  truncated  surface 
is  used  in  Equation  33  to  create  the  EDBFM. 

The  EDBFM  holds  information  about  the  orientation  of  the  decision  boundary  in  R^. 
The  connection  is  that  normal  vectors  to  the  decision  boundary  indicate  the  discriminantly 
relevant  directions  in  R^.  The  outer  product  of  a  normal  vector  with  itself  yields  contributions 
to  the  EDBFM  across  the  decision  surface,  S.  An  outer  product,  which  forms  an  N  by 
dimensional  matrix,  may  be  viewed  as  an  operator  (36).  For  the  EDBFM  analysis,  one  takes 
the  outer  product  of  a  column  vector  with  itself: 


Y  * 


(34) 


If  this  operates  on  some  other  vector,  one  writes: 

(T  •  1^)  •  X,  (35) 

but  this  can  be  reorganized  as 

X  •  (X^  •  $)  .  (36) 

The  term  in  the  parenthesis  is  a  familiar  inner  product  and  is  the  projection  of  x  on  X-  This 
results  is  a  scalar,  leaving 


ex  (37) 

where  ^  is  the  scalar.  Thus,  an  outer  product  provides  a  way  to  project  one  vector  (^  onto  a 
second  vector  (X)  and  then  orient  along  the  second  vector  in  R^.  Lee’s  algorithm  says  the 
EDBFM  is  an  average  of  outer  products  from  vectors  produced  along  the  decision  boundary. 
Because  they  are  normal  to  the  decision  boundary,  these  vectors  form  a  correspondence  with 
discriminantly  important  directions  in  the  feature  space. 
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The  eigenvectors  of  the  EDBFM  give  an  orthonormal  basis  set  for  the  space  in  which 
classes  are  best  separated,  with  respect  to  the  decision  boundary  itself.  The  dominant  eigen¬ 
vectors  of  the  EDBFM  are  associated  with  the  directions  in  which  one  is  most  likely  to  cross 
the  decision  boundary.  The  dominant  eigenvalues  may  be  used  to  indicate  the  most  relevant 
eigenvectors,  as  in  the  KLT  method  described  earlier. 

Note  that  the  process  of  taking  outer  products  to  form  the  EDBFM  parallels  the  KLT,  in 
which  outer  products  are  used  to  form  the  covariance  matrix.  Both  matrices  hold  information 
about  the  direction  of  maximum  variance  of  their  constituent  vectors.  The  eigenvectors  of  each 
provide  the  tool  to  transform  the  space  into  a  reduced  feature  space.  Thus,  a  transformation 
using  the  matrix  of  eigenvectors  of  the  EDBFM  orients  the  space  with  respect  to  the  decision 
boundary. 

See  the  appendices  for  sample  problems  and  computer  code.  The  steps  used  in  imple¬ 
menting  this  procedure  are  listed  here  and  in  the  Lee  article  (29:394). 

1.  Classify  the  training  samples  using  estimated  mean  and  covariance  matrices  and  retain 
only  those  samples  from  each  class  that  are  correctly  classified.  This  step  insures  that 
the  equations  used  below  are  solvable.  Also,  apply  a  chi-square  threshold  test  to  each 
class  to  eliminate  outliers  within  the  class.  Use  the  Mahalanobis  distance  metric.  Note 
that  the  Mahalanobis  distance  does  not  include  the  bias  term.  For  Ui,  apply  steps  2 
through  6,  below. 

2.  Apply  the  chi-square  threshold  test  of  ui  to  U2,  again  using  a  Mahalanobis  distance.  In 
other  words,  use  only  those  samples  in  U2  which  are  relatively  close  to  the  mean  of  Wi. 
Retain  a  predefined,  minimum  number  of  the  closest  samples  from  uj2  if  too  few  pass 
the  threshold  test.  One  may  use  a  different  threshold  than  in  step  one.  Steps  1  and  2 
limit  the  calculation  to  the  effective  DBFM. 

3 .  For  every  sample  of  uJi,Xi,  find  its  nearest  neighbor  in  ^2 ,  ^2 •  Essentially,  at  this  point, 
one  is  drawing  a  line  between  the  points.  Because  only  correctly  classified  samples 
have  been  retained,  this  line  intersects  h{x). 
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4.  Because  all  retained  samples  should  straddle  the  decision  boundary,  find  the  point  Pi 
where  the  line  coimecting  Xi  and  X2  intersects  the  decision  boundary,  h{x).  This  step 
forms  an  estimate  of  the  complete  decision  boundary. 

5.  Find  the  unit  normal  vector,  iVj,  to  the  decision  boundary  at  the  point  Pi.  Here,  the 
subscript  i  represents  the  iteration  through  this  list. 

6.  Repeat  steps  three  through  five  for  all  Ki  samples  of  uji  and  form  an  estimate  of  the 

effective  decision  boundary  feature  matrix,  ^edbfm  where 

1 

i 

1.  Repeat  steps  two  through  six  for  0^2- 

8.  Calculate  the  final  EDBFM  with 

^EDBFM  —  "^\dBFM  '^'eDBFM‘  (39) 

The  process  is  generalized  to  the  multi-class  case  (with  K  classes)  with 

K  K 

^EDBFM  =  ^{^i)P{^j)^'DBFM-  (^) 

k  j,j^k 

Lee  stresses  that  the  theorems  he  develops  hold  for  multiclass  problems,  with  the  above 
equation  collating  the  class  to  class  comparisons,  weighted  by  their  probabilities  (29:395). 

The  implementation  of  these  steps  in  this  thesis  follows  his  algorithm  closely,  but  not 
exactly.  Estimated  parameters  of  the  discrimination  function,  h{x)  were  calculated  from  a 
distinct  training  set  to  maintain  independence  between  training  and  test  sets.  Equations  which 
find  intersection  points  and  normal  vectors  are  provided  by  Lee  (29:395).  The  computer  code 
in  Appendix  C  implements  the  steps  listed  above  along  with  the  equations  used  to  find  the 
intersection  in  Step  3  and  the  normal  vector  in  Step  4.  Essentially,  the  same  results  as  found 
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in  Lee’s  sample  problems  were  produced  with  this  code.  The  code  in  Appendix  C  is  set  up  to 
reproduce  Lee’s  sample  problems  (Examples  3, 4,  and  5  (29:396-397)). 

2.4.3  Wavelet  Analysis.  Wavelet  decomposition  in  the  context  of  a  multiresolution 
analysis  (MRA)  is  a  relatively  new  technique  which  has  especially  matured  in  the  last  ten 
to  fifteen  years.  The  seminal  article  tying  together  the  mathematics  behind  using  wavelets 
in  an  MRA  is  Stephane  Mallat’s,  A  Theory  for  Multiresolution  Signal  Decomposition:  The 
Wavelet  Representation  (31).  The  discrete  wavelet  transform  projects  a  signal  into  a  series 
of  nested  subspaces  and  their  orthogonal  complements.  The  analogy  with  Fourier  series 
frequency  representations,  in  particular  windowed  Fourier  transforms,  is  tempting,  but  not 
quite  exact.  The  wavelet  kemal  is  parameterized  with  shifts  and  scale  while  the  windowed 
Fourier  analysis  is  parameterized  with  shifts  and  frequency.  Instead  of  frequency  as  the  key 
parameter  in  forming  the  transformation  kemal,  scale  is  the  key  parameter.  With  respect  to 
the  UHRR  radar  problem,  wavelet  analysis  looks  to  pick  out  unique  scale  information  (class 
to  class)  from  the  time-based  signature. 

The  nested  subspaces  are  called  approximation  levels.  Projections  into  them  are  anal¬ 
ogous  to  successive  low  pass  filtering  operations.  See  Mallat  on  how  the  filter  coefficients 
are  derived.  In  many  cases,  filter  coefficients  may  be  chosen  tailored  to  the  problem  at  hand. 
The  resultant  approximation  signals  have  frequency  content  roughly  corresponding  to  octave 
subbands.  Referring  to  Figure  3,  the  approximations  are  labelled  “A,”  “AA,”  and  “AAA.” 
The  associated  detail  spaces  contain  information  “lost”  when  one  projects  into  an  approxima¬ 
tion  space.  Each  approximation  is  a  subspace  of  the  original  space  and  of  the  space  of  the 
approximation  of  the  previous  level. 

Projections  into  the  detail  spaces  result  from  high  pass  filtering  of  the  approximation 
signals.  The  filter  generating  detail  signals  is  derived  from  the  filter  generating  the  approxi¬ 
mations  and  may  be  viewed  as  quadrature  mirror  filters  (31).  Each  detail  space  is  orthogonal 
to  the  approximation  spaces  and  detail  spaces  at  its  level  of  scale  and  below. 
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During  an  MRA,  filtering  is  applied  recursively  to  the  approximation  signal.  A  direct 
sum  of  an  approximation  signal  and  detail  signal  is  employed  to  reconstruct  the  approxima¬ 
tion  at  the  next  higher  level  of  scale.  The  intricacies  of  the  mathematics  may  be  found  in 
Mallat’s  article.  Note  that  with  each  projection,  the  signal  is  downsampled  by  2.  Each  set  of 
approximation  and  detail  coefficients  is  a  representation  of  the  original  signal  in  that  subspace. 
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Figure  3.  Original  Signal  Space  Projected  with  Conventional  Multiresolution  Analysis 


For  the  purposes  of  this  thesis,  wavelets  provide  a  tool  to  extract  relevant  scale  informa¬ 
tion  about  the  targets.  Wavelets  will  be  used  in  the  context  of  finding  alternative  orthonormal 
bases  for  the  space  in  which  the  original  signal  is  represented.  The  wavelet  representations  of 
the  original  signal  are  used  as  inputs  to  the  Gaussian  classifier. 

Wavelet  packets  are  a  modified  version  of  the  conventional  MRA  in  that  detail  signals 
are  recursively  filtered,  as  well.  As  pointed  out  by  Coifman  (10:714),  a  wavelet  packet  library 
“corresponds  roughly  to  a  covering  of  ‘frequency’  space.”  The  term  “covering”  implies  that 
the  set  of  signals  that  can  be  represented  in  the  original  space  can  also  be  represented  by  the 
bases  generated  through  the  MRA.  The  wavelet  packet  bases  are  organized  into  nodes  of  a 
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binary  tree,  as  shown  in  Figure  4.  In  the  context  of  Section  2.4.1,  representations  of  the  signal 
are  projected  into  subspaces  at  each  level  of  decomposition  in  the  tree.  Instead  of  a  standard 
MRA,  in  which  only  approximation  signals  are  recursively  decomposed,  all  detail  signals  are 
recursively  decomposed  at  all  leaves  in  the  binary  tree.  The  key  in  using  wavelets  instead  of 
a  windowed  Fourier  transform  is  that  many  of  the  detail  spaces  are  orthonormal.  This  implies 
that  the  information  contained  in  a  detail  signal  is  unique  relative  to  other  approximations  and 
details  at  the  same  scale  or  below. 
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Figure  4.  Original  Signal  Space  Projected  into  Wavelet  Packet  Subspaces 


For  feature  discrimination,  the  idea  is  to  find  those  scales  which  contain  the  relevant 
information  for  classification.  In  the  Coifman  article,  as  implied  by  its  title,  ’’Entropy-Based 
Algorithms  for  Best  Basis  Selection,”  the  premise  is  that  entropy  is  an  appropriate  criteria  in 
determining  the  best,  minimal  set  of  bases  to  represent  a  signal.  This  concept  was  applied 
by  Chang  and  Kuo  (6)  with  respect  to  identifying  textured  images.  Their  goal  is  to  find  the 
minimum  number  of  features  sufficient  for  classification  (6:429).  However,  in  the  entropy 
measure,  they  are  using  a  tool  designed  to  give  a  minimal  representation  of  the  original 
signal,  and  is  not  specifically  designed  to  minimize  classification  error.  Recently,  Coifman 
has  developed  a  technique  which  selects  scales  based  on  classification  error  (9). 
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Chang’s  approach  is  to  recognize  that  certain  textures  have  little  or  no  frequency  infor¬ 
mation  in  certain  bands.  They  “detect  the  significant  frequency  channels”  by  employing  an 
averaged  li  norm  (6), 

(41) 

■'''  i=l 

where  x  is  the  vector  of  features  at  a  given  approximation  or  detail  level. 

The  criterion  for  retaining  a  given  decomposition  level,  or  subspace,  is  to  compare 
its  energy  to  the  largest  energy  value  at  that  scale.  Chang  states  if  e  <  C  •  emax.  then 
stop  decomposing  at  that  level.  In  other  words,  this  scale  is  relatively  unimportant  to  the 
representation  of  the  original  signal.  Chang  goes  on  to  lay  out  a  training  and  classification 
scheme  using  the  li  norms  as  features. 

The  essence  of  Chang’s  procedure  is  to  calculate  the  “energy  map”  of  a  given  texture  or 
class  during  training.  Complete  tree-structured  wavelet  packet  bases  are  computed  for  each 
training  sample  and  h  norms  are  calculated  for  each  set  of  approximations  and  details  at  each 
scale.  All  of  these  values  are  retained,  no  matter  what  the  relative  magnitudes  of  the  energies 
may  be.  When  a  test  vector  is  presented  to  the  classifier,  it  is  decomposed  in  the  same  way. 
This  time,  only  those  levels  with  significant  energies  are  retained  as  features.  These  features 
are  then  compared  to  the  corresponding  features  in  each  template  and  one’s  favorite  measure 
of  similarity  (such  as  Equation  24)  is  applied. 

Wavelets  have  great  potential  as  a  frequency  analyzer,  but  one  disadvantage  is  their 
shift  variance.  Because  alignment  and  registration  are  critical  with  respect  to  the  UHRR  radar 
data,  a  way  to  address  this  issue  must  be  found.  A  solution  utilized  in  this  thesis  was  presented 
by  Suzuki  in  a  report  written  as  part  of  her  PhD  minor  examination  (46).  In  this  technique, 
instead  of  downsampling,  all  values  from  a  given  filtering  operation  are  retained  at  each 
level  of  decomposition.  Thus,  for  a  256-bin  radar  signature,  all  approximations  and  details 
would  also  contain  256  bins.  The  shift  variance  problem  is  solved,  because  the  downsampling 
does  not  remove  every  other  range  bin  from  consideration  (46:9).  Other  solutions  do  exist. 
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including  an  article  by  DelMarco  and  Weiss  (12)  which  applies  wavelet  packets  and  a  shift 
invariant  wavelet  transform  with  damped  sinusoids  as  test  vectors. 

One  other  important  aspect  of  the  multiresolution  analysis  must  be  addressed  in  the 
application  of  wavelets:  edge  effects.  These  will  have  a  drastic  impact  on  the  features  collected, 
especially  as  one  progressively  filters  through  each  level  of  decomposition.  Adaptive  wavelets 
have  been  developed  by  Cohen,  Daubechies,  and  Vial  to  alleviate  the  problem  (8).  These 
authors  indicate  more  ad  hoc  techniques  also  exist  which  are  easier  to  implement.  Some 
examples  are:  periodically  extending  the  signal  and  reflecting  the  signal  about  its  endpoint 
to  form  a  signal  twice  as  long.  Circular  convolution  during  filtering  may  be  employed  to 
invoke  the  periodic  extension.  Sometimes,  if  the  signal  trails  off  to  zero  at  the  endpoints,  zero 
padding  is  effective.  Finding  an  effective  strategy  is  often  problem  dependent. 

Other  researchers  have  used  MRA’s  and  wavelet  packets  for  signal  processing  and,  in 
particular,  UHRR  radar  returns.  A  review  of  recent  advances  with  respect  to  wavelet  pattern 
recognizers  is  given  by  Benveniste  and  others  (4).  Chou  and  others  use  the  multiscale  analysis 
approach,  noting  that  branches  in  the  dyadic  trees  correspond  to  scale  representations  of  a 
time  signal  (7).  Baras  and  Wolk  present  a  method  for  on-line  automatic  target  recognition 
systems  (3).  Their  proposed  algorithms  include  vector  quantization  via  aspect  graphs  and 
vector  quantization  techniques  via  the  Linde-Buzo-Gray  algorithm  for  clustering.  Baras  cites 
Gersho  and  Gray  with  respect  to  the  the  vector  quantization  techniques  (22). 

The  work  most  similar  to  this  thesis  is  a  July,  1994  article  by  Coifman  and  Saito  (9) 
which  gives  extensions  to  Coifman ’s  1992  article  on  entropy  analysis  of  wavelet  bases  for 
minimum  reconstruction  error.  In  the  July,  1994  article,  classification  error  is  addressed.  The 
extension  is  they  propose  two  techniques  for  feature  selection  from  the  wavelet  packet  bases. 
One  technique  is  to  use  a  Fisher  discriminant  on  the  wavelet  packet  bases.  The  other  technique 
creates  “an  adaptive  orthonormal  basis  [at  each  leaf  in  the  tree]  which  minimizes  a  measure 
of  the  prediction  error  (such  as  error)  for  the  regression  problem  (9: 194).”  The  analysis  in 
that  article  is  restricted  to  synthetic  data. 
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2.5  The  Adaptive  Gaussian  Classifier  (AGC ) 

The  AGC  takes  UHRR  radar  signatures,  preprocesses  them  and  applies  a  Gaussian 
discriminant.  The  purpose  of  this  section  is  to  describe  the  algorithm  outlined  in  Figure  5. 
This  research  makes  an  explicit  distinction  between  Gaussian  discriminants  and  the  AGC. 
The  term  AGC  always  includes  the  entire  preprocessing  and  alignment  scheme  as  developed 
by  Hughes  Aircraft  Corporation.  The  overall  methodology  is  to  first  create  templates  from 
training  data  and  then  compare  test  signatures  against  them,  as  described  earlier  for  the  general 
pattern  recognition  problem.  Each  return  in  a  range  bin  is  treated  as  a  dimension  in  the  feature 
space.  A  template,  as  generated  by  the  AGC,  holds  mean  and  variance  information  for  each 
of  the  range  bins  for  a  given  target  class.  An  incoming,  unknown  signature  is  compared  and 
aligned  against  class  templates  to  find  its  “closest”  match  for  classification. 

2.5.1  The  Data.  UHRR  radar  waveforms  are  chirp  radar  returns  from  aircraft.  Radar 
energy  received  from  a  target  is  demodulated  with  a  linear  frequency  ramp  function.  Energy 
from  scatterers  at  a  given  range  is  put  into  a  given  range  bin.  Each  range  bin  contains  energy 
corresponding  to  an  incremental  distance  from  the  radar  receiver.  Radar  energy  corresponding 
to  the  nose  of  the  aircraft  is  located  in  the  lower-indexed  range  bins  and  energies  from  trailing 
edges  are  found  in  higher-indexed  range  bins.  An  important  problem  in  this  radar  technique  is 
a  wrap  around  effect  which  occurs  because  of  the  range  uncertainty  imposed  by  the  acquisition 
and  demodulation  process  of  the  underlying  chirp  radar  system.  Details  of  chirp  radar  and 
radar  stretching  may  be  found  in  Stimson  (44:217-223). 

In  this  analysis,  an  unprocessed  UHRR  radar  signature  consists  of  8 12  range  bins.  Each 
range  bin  has  an  in-phase  and  quadrature  component.  During  acquisition,  the  signals  have 
been  oversampled  by  a  factor  of  eight.  Before  classification  or  training,  each  radar  signature 
is  preprocessed.  The  specifics  of  ARTI  Phase  III  data  collection  may  be  found  in  (1). 

2.5.2  Preprocessing  and  Coarse  Alignment.  The  preprocessing  sequence  is  shown 
in  Figure  6.  The  overall  purpose  of  preprocessing  is  to  normalize  the  data,  correct  the  wrap¬ 
around  problem,  and  align  the  data.  Alignment  occurs  in  a  two-step  process  consisting  of 
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The  Adaptive  Gaussian  Classifier 


Training  Procedure 


Testing  Procedure 


Read  in  One  Training  Signature 


Read  in  One  Test  Signature 


Preprocessing 
(Section  2.5.2) 


Preprocessing 
(Section  2.5.2) 


Coarse  Alignment 
(Section  2.5.2.2) 


Coarse  Alignment 
(Section  2.5.2.2) 


Compute  Correspondence 
to  Each  Template 
(Section  2.5.5) 


Compute  Correspondence 
to  Current  Template 
(Sections  2.5.3,  2.5.4) 


Update  Mean  and  Variance  Templates 
(Section  2.5.4) 


Compute  Probabilities  for  Each  l  1 
(Section  2.5.5) 


Figure  5.  AGC  Overview 
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coarse  alignment  and  fine  alignment.  Technically,  coarse  alignment  occurs  in  the  preprocessing 
section  of  the  AGC  computer  code.  Fine  alignment  occurs  as  a  separate  step  during  template 
construction  or  classification. 

2. 5. 2.1  Magnitude  and  Downsample.  Let  the  stored  UHRR  radar  signal 
be  represented  by  the  vector,  y,  with  the  complex  value  of  the  return  in  the  range  bin 
represented  by  ^[i].  Each  range  bin  contains  in-phase  and  quadrature  phase  components  vi[i] 
and  vqli].  The  first  step  in  preprocessing  is  to  take  a  simple  magnitude  for  all  i: 

Vmag\i]  =  ^ [i]  +  [i] .  (42) 

The  resulting  signature  vector  is  downsampled  by  three,  after  deleting  the  first  22  bins 
and  final  22  bins,  to  produce  a  new  signal  vector,  a.  The  assumption  here  is  that  the  initial 
range  bins  are  instrument  noise  associated  with  the  acquisition  process.  The  new  signal  vector 
a  is  of  length  256. 

2.5.2.2  Circular  Centroid.  A  circular  centroid  (CC)  performs  the  coarse 
alignment  on  a  and  solves  the  wrap-around  problem.  The  CC  finds  the  power  centroid  of  a 
signal  when  the  starting  and  ending  points  of  the  signal  are  assumed  unknown.  This  coarse 
alignment  of  a  signature  is  designed  to  get  a  given  signature  within  ±18  range  bins  of  proper 
alignment  with  its  class  template.  The  number  18  was  chosen  as  the  range  of  shifts  in 
experiments  performed  by  DeWall  (14).  This  limit  on  the  window  helps  prevent  confusion 
during  testing  because  exemplars  are  less  likely  to  be  matched  against  the  wrong  template. 

Mathematically,  the  CC  operates  on  the  signal  vector  a  as  follows: 


Vwgt 

256 

=  ^  a[i]  •  sin  iO, 

(43) 

i=l 

256 

Xwgt 

=  afi]  •  cos  ifi, 

t=i 

(44) 
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where  Q.  =  yy,gt  and  x.uigt  represent  the  concentration  of  power  along  that  axis  in  the 
xy-plane.  These  coordinates  correspond  to  a  point,  /?,  as  shown  in  Figure  7.  The  terms  given 
in  Equations  43  and  44  correspond  to  the  real  and  imaginary  components  of  the  fundamental 
frequency  in  Fourier  analysis.  Thus,  in  a  sense,  the  CC  is  a  measurement  of  the  energy  content 
of  the  signature  in  a  single  frequency. 


Projection  onto  Unit  Circle 


In  Figure  7,  to  find  the  angle,  0,  corresponding  to  (5,  take  the  arctangent  of  its  compo¬ 
nents: 


0 


arctan 


Vwgt 

X^wgt 


(45) 


The  center  bin  index,  C,  is  found  by  converting  back  from  radians  to  bin  index  number: 


C  = 


© 
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(46) 


Once  this  central  bin  index  is  found,  it  is  a  simple  matter  to  rearrange  the  signal  vector  into  a 
signal  vector,  b,  whose  power  centroid  is  its  center  bin. 

The  coarse  alignment  sequence  is  diagrammed  in  Figure  8.  CC  takes  the  signal  vector 
and  “wraps”  it  into  a  circle  in  2-space  with  x  and  y  components.  Relative  power  strengths  in 
the  directions  of  the  x  and  y  axes  are  found  and  the  coordinates  of  this  point  give  the  direction 
of  the  circular  power  centroid  from  the  origin.  The  signal  is  then  adjusted  and  “unwrapped” 
to  complete  the  sequence. 


Incoming  Signature  Projection  onto  Unit  Circle 


Range  Bin 

Figure  8.  Coarse  Alignment  Process 

The  disadvantage  of  the  coarse  alignment  technique  is  that  abnormal  flashes  in  the 
return  of  one  exemplar  of  a  given  aircraft  may  cause  improper  alignment.  These  flashes  can 
wash  out  information  and  cause  an  exemplar  to  appear  very  different  from  its  class  template. 
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especially  in  the  sense  of  alignment.  Such  a  flash  problem  is  illustrated  in  Figure  9.  This 
figure  shows  two  signatures  from  one  class.  The  signal  with  the  flash  no  longer  as  similar 
with  other  members  of  its  class.  The  noise  flashes  will  affect  the  coarse  alignment  procedure 
and  judgments  of  similarity  during  classification. 


Figure  9.  Left:  Class  (2,  Signal  “A,”  with  Noise  Flash;  Right:  Class  Typical  Signal 


2.5.23  Power  Normalization  and  Transformation.  Each  signal  is  energy 
normalized,  as  would  be  expected  in  any  algorithm  using  correlations  to  align  signatures 
against  templates.  Mathematically,  compute  the  energy  normalized  version  of  the  signals: 


p  ^  ./MSZ 

V  256 


(47) 

(48) 


The  power  transformation  shown  in  Figure  6  makes  the  data  more  “Gaussian-like”. 
The  underlying  distribution  of  the  UHRR  radar  data  is  Rician  (34),  but  the  AGC  inherently 
assumes  a  Gaussian  pdf.  The  intention  is  to  make  the  underlying  UHRR  radar  pdf  more 
closely  appear  Gaussian.  This  is  a  random  variable  transformation  described  in  Fukunaga 
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(19:76).  The  transformation  is: 


<i[il  =  (cH)”'*.  (49) 

The  radar  signature  ready  for  training  or  testing  is  designated  by  the  vector  d.  This  vector  is 
incorporated  into  a  template  during  training  or  tested  against  class  templates  during  testing. 

2.5.3  Training.  The  user  of  the  AGC  may  choose  how  many  of  the  256  remaining 
range  bins  to  use  for  training  and  testing.  The  algorithm  leaves  off  bins  symmetrically  from 
each  end  of  the  signature.  The  default  value  is  set  at  192  range  bins.  Exemplars,  denoted  by 
d  again,  form  192-dimensional  templates,  where  each  range  bin  is  a  feature.  The  template 
file  holds  mean  and  variance  information  for  each  range  bin.  Template  calculations  are 
made  sequentially,  as  each  d  is  presented  after  preprocessing.  Features  are  assumed  to  be 
uncorrelated. 

Here,  ftk  [*]  and  &l  [t]  represent  the  mean  and  variance  for  the  range  bin  after  k  sig¬ 
natures  have  been  presented  for  training.  The  equations  implemented  in  the  AGC’s  computer 
code  are  straight  forward  and  follow  from  the  theory  presented  earlier.  The  initial  values  are 
set  from  the  first  exemplar,  di  [z]: 


AiW 

=  di[i] 

(50) 

dl[{\ 

=  0. 

(51) 

For  the  exemplar, 


^  {k  -  1)  •  p,k-i[i]  +  dk[i 

MfcW  -  ^ 

(52) 

“"‘W  =  k  ^k-1- 

(53) 

The  latter  equations,  which  recursively  calculate  mean  and  variance  estimates  are  given  by 
Schalkoff  (41:65). 
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2.5.4  Adaptation.  In  the  context  of  this  thesis,  the  term  “adaptive”  only  describes 
the  process  of  shifting  the  incoming  signal  to  the  developing  templates,  which  is  formed 
iteratively  as  each  exemplar  is  read  during  training.  There  exists  code  at  WL/AARA  which 
incorporates  “adaptation”  in  another  sense.  In  that  case,  a  compensation  parameter  is  added 
to  the  routine  which  accounts  for  changes  in  magnitude  of  the  incoming  signal  (34).  Thus, 
data  taken  at  different  ranges  may  be  compared  even  though  data  from  nearer  ranges  may 
have  higher  magnitudes  than  fainter  returns.  In  the  computer  code  itself,  mean  and  variance 
calculations  are  made  after  finely  aligning  the  exemplar  with  the  —  1)  mean  template. 
For  training,  aligrunent  is  found  in  the  sense  of  a  minimum  Euclidean  distance. 

The  exemplar  is  shifted  until  the  minimum  distance  is  found  between  the  exemplar  and 
mean  template.  Shifts  are  made  up  to  eighteen  bins  in  either  direction,  relying  on  the  premise 
that  the  CC  has  already  coarsely  aligned  the  training  signature  to  the  class  mean  template. 
When  the  CC  fails  to  place  the  “true”  centroid  within  18  range  bins  of  its  class  mean,  this 
fine  alignment  step  fails.  Shifts  are  made  circularly:  in  other  words,  bins  shifted  off  one  end 
reappear  at  the  other. 

The  mathematics  are  as  follows,  with  [j]  representing  the  squared  Euclidean  distance 

for  the  k^^  exemplar  offset  by  j : 

256 

Dk[j]  =  ^(4[(*  +  j)©256]-Afc-i[i])'  (54) 

i=l 

j  e  [-18, 18],  (j  an  integer).  (55) 

The  O  symbol  represents  the  modulo  operator  and  implements  the  circular  shifting  of  the 
exemplar,  d.  From  the  above  equation,  the  shift  is  chosen  that  minimizes  Dk.  Then, 
instead  of  Equation  52, 

_  (A:  —  1)  •  ftk-ili]  +  dk{{i  +  ibest)  ©  (256)) 

k 

is  used.  This  equation  uses  the  best  matched  version  of  dfc  in  a  Euclidean  distance  sense.  The 
calculation  for  al[i]  remains  the  same. 
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The  above  steps  are  repeated  for  as  many  exemplars  as  the  researcher  wishes  to  use  to 
form  the  class  template.  The  process  is  executed  separately  for  each  target  class,  and  presumes 
the  use  of  labelled  data.  Templates  generated  during  training  are  placed  in  separate  target  files. 
For  testing,  individual  templates  are  concatenated  into  a  single  pattern  file. 

2.5.5  Testing.  Noting  Figure  5,  for  each  unknown  test  signature  presented  to 
the  classifier,  the  preprocessing  steps  outlined  in  section  2.5.2  are  performed  to  generate  a 
normalized  and  coarsely  aligned  feature  vector. 

The  testing  algorithm  computes  Mahalanobis  distances  between  the  unknown  vector, 
d,  and  each  class  template  vector  contained  in  the  pattern  file.  As  in  training,  the  test  vector 
is  circularly  shifted  against  each  class  template  for  fine  alignment  and  the  best  (minimum) 
value  is  stored  for  each  class.  The  AGC  implements  a  portion  of  the  discriminant  equation. 
Equation  31.  Only  half  the  terms  are  used,  as  shown  here,  for  a  comparison  with  the  Ui 
template: 

md{x)  =  (57) 

where  is  a  row  vector  containing  estimated  variances  for  Class .  The  “ .  operator  indicates 
element  by  element  multiplication  and  the  “./”  operator  indicates  element  by  element  division. 
The  equation  has  simplified  because  of  the  assumption  of  uncorrelated  features  and  because 
one  is  calculating  the  a  posteriori  probability  for  a  specific  class.  The  linear  algebra  works 
out  the  same  as  with  a  full  covariance  matrix,  but  computation  time  is  saved.  It  is  at  this  point 
that  two  important  assumptions  in  this  classifier  become  evident.  Features  are  assumed  to  be 
uncorrelated  (34)  and  the  decision  boundaries  are  assumed  to  be  quadratic.  That  is,  covariance 
matrices  are  estimated  for  each  class  and  they  are  non-zero  only  on  the  main  diagonal. 

The  testing  process  is  the  same  as  training,  except  that  the  Mahalanobis  distance  is 
used  plus  the  bias  term,  for  example,  Bi  for  class  1.  The  bias  term  for  class  1,  meaning  the 
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determinant  of  the  covariance  matrix  simplifies  to 


256 

El  = 

(58) 

i—1 

256 

Bi  =  ^lnd-^[i]. 

i=l 

(59) 

For  a  given  test  signature,  p{u}x  |x)  =  is  calculated  for  each  template  and  the  values 

are  put  in  a  matrix  and  saved  to  an  output  file.  This  file  contains  K  columns  with  respective 
classification  distances  to  each  class  template.  The  user  must  use  his  own  algorithm  to  translate 
the  results  to  a  confusion  matrix. 

2.6  Summary 

This  chapter  provides  the  tools  used  in  the  remainder  of  this  thesis.  The  statistical  theory 
is  given  which  allows  for  an  evaluation  of  the  AGC  by  estimating  the  AGC’s  classification 
rate.  Methods  for  understanding  the  data  with  respect  to  separability  issues  and  feature 
discrimination  are  provided.  The  effects  of  assuming  that  the  underlying  distributions  are 
Gaussian  may  now  be  evaluated.  Feature  discrimination  using  the  technique  given  by  Lee  and 
Landgrebe  is  applied  and  extended.  Also,  the  groundwork  for  using  a  multiresolution  analysis 
and  discriminating,  wavelet  features  is  provided.  The  next  chapter  explains  the  methodology 
used  to  meet  the  thesis’  goals. 
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in.  Evaluation  Methodology 


3.1  Introduction 

The  purpose  of  these  experiments  is  to  evaluate  the  impact  of  alignment  and  radar 
signatures  with  flashes  on  the  AGC.  These  tests  attempt  to  determine  whether  there  is  a 
statistically  significant  problem  with  respect  to  alignment.  The  AGC  is  a  simple  Gaussian 
classifier  with  a  couple  of  twists.  The  first  twist  is  the  alignment  procedure,  which  is  divided 
into  two  parts.  The  first  part  is  the  circular  centroid  technique  described  in  Chapter  2  to 
“coarsely  align”  data.  The  other  part  is  the  fine  alignment  procedure,  which  is  an  integral  part 
of  the  final  decision  making  process.  This  is  because  each  test  signature  is  ideally  aligned 
against  each  class  template  independently  and  then  the  best  “match”  among  the  templates  is 
declared  the  winner.  This  process  is  known  as  establishing  proper  correspondence  between 
an  exemplar  and  template.  The  second  twist  in  the  AGC  is  to  calculate  the  mean  and  variance 
of  the  features  recursively  during  training.  After  being  circularly  centroided,  each  training 
signature  is  finely  aligned  against  the  cumulative  mean  template  and  is  then  added  to  the  mean 
and  variance  templates  proportionally.  This  twist  may  have  an  impact  on  results  because 
irregular  signatures  presented  to  the  AGC  early  in  the  training  process  may  adversely  affect 
performance.  One  way  to  combat  this  problem  is  to  repeat  trials  many  times  and  randomize 
presentation  order. 

Three  potential  problems  are  evident  with  the  implementation  of  the  AGC.  One  is  its 
ability  to  perform  the  coarse  alignment.  The  alignment  problem  has  been  suspected  as  the 
source  of  most  errors  in  UHRR  classification  (34).  Noise  flashes  can  cause  the  centroid  of  an 
exemplar  to  be  far  from  the  average  centroid  of  others  in  its  class.  The  circular  centroid  must 
adjust  the  signal  to  within  ±18  range  bins  to  have  a  chance  for  correct  registration  with  its 
actual  class  template  during  the  fine  alignment  stage  in  testing.  During  AGC  evaluation,  this 
problem  is  addressed  by  removing  problem  signatures  (signatures  that  are  coarsely  aligned 
such  that  the  centroid  is  not  inside  the  ±18  window  as  required  by  the  fine  alignment  process) 
and  running  the  AGC  with  and  without  these  signatures. 
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Another  problem  relates  to  the  assumption  of  Gaussian  pdf’s.  Not  only  does  the  AGC 
make  an  important  assumption  about  the  underlying  statistics  of  the  data,  but  as  a  quadratic 
classifier,  it  places  severe  conditions  on  the  classifier  because  of  the  effective  number  of 
features  which  are  used.  During  AGC  evaluation,  this  problem  is  addressed  by  giving  the 
same  data  sets  to  a  non-parametric  classifiers  (k-NN  and  neural  network).  In  this  case,  the 
alignment  problem  is  not  addressed,  only  the  the  ability  of  the  AGC  is  compared  to  other 
classifiers. 

The  third  potential  problem  rests  with  the  data  itself.  When  a  signature  is  aligned  poorly 
by  the  circular  centroid,  is  there  something  inconsistent  in  that  sample  with  respect  to  other 
samples  of  its  class?  This  question  has  to  do  with  the  underlying  information  content  of  the 
signatures.  During  AGC  evaluation,  this  problem  is  addressed  by  quantifying  the  underlying 
information  content  of  the  data  by  using  a  Bayes  bounding  experiment  implemented  by 
Martin  (33). 

The  last  portion  of  the  chapter  explores  the  feature  discrimination  technique  developed 
by  Lee  (29).  This  technique  seeks  to  find  the  most  discriminantly  relevant  features.  This 
research  explores  the  AGC’s  performance  with  respect  to  these  questions  with  computer 
resources  available  at  AFIT.  In  Appendix  B,  the  results  for  two  alternative  training  schemes 
are  given  which  yield  moderately  better  results  with  respect  to  the  alignment  challenge. 

3.2  Data  Preparation 

Data  is  parsed  into  three  sets:  “arbitrary  data,”  “hand-aligned  data,”  and  “pristine  data.” 
The  total  number  of  signatures  retained  in  each  case  is  shown  in  Table  1.  The  criteria  for 
separating  the  data  are  listed  below. 

arbitrary  data  This  case  includes  all  data  from  the  original  target  set. 

hand-aligned  data  This  case  includes  all  data  from  the  original  target  set.  Each  signature  is 
visually  checked  after  the  circular-centroid  for  proper  alignment.  If  a  signature  does  not 
fall  within  the  ±  1 8  window  after  coarse  alignment,  then  it  is  designated  as  an  improperly 


45 


Table  1.  Number  of  Signatures  in  Data  Sets 


Class 

Pristine 

Hand  Aligned 

Arbitrary 

9a 

986 

1008 

1008 

aa 

1006 

1008 

1008 

f2 

591 

1031 

1031 

fa 

998 

1008 

1008 

aligned  signature.  Improperly  aligned  signatures  only,  referred  to  as  “bad”  signatures, 
are  forced  to  fall  within  the  ±18  window  by  adjusting  the  position  of  the  signal  by 
hand.  This  was  accomplished  with  an  interactive  algorithm  developed  for  WL/AARA 
by  Veda,  Inc.  When  this  data  is  submitted  to  the  AGC  for  training  or  classification,  the 
circular  centroid  subroutine  is  turned  off  to  retain  the  human  aligned  positionings  of  the 
bad  signatures. 

pristine  data  This  case  eliminates  all  signatures  designated  as  bad  from  the  database.  The 
pristine  case  retains  only  those  signatures  which  the  CC  subroutine  successfully  placed 
within  the  ±18  window.  Class  f2  has  the  greatest  number  of  problem  signatures,  while 
class  aa  has  only  two  problem  signatures. 


3.3  Impact  of  Alignment  on  Classification 

3.3.1  Test  Criteria.  For  each  data  set  case  (pristine,  hand-aligned,  arbitrary),  the 
AGC  is  used  in  a  four  class  environment.  The  performance  metric  used  is  a  confusion  matrix 
and  the  overall  probability  of  correct  classification  Pec  is  computed  by  averaging  individual 
class  classifications  (19:66).  Multiple  trials  bound  Pcc  with  97.5%  confidence  intervals.  As 
stated  earlier,  these  tests  attempt  to  determine  whether  there  is  a  statistically  significant  problem 
with  respect  to  alignment. 

3.3.2  Test  Methodology.  As  summarized  in  Figure  10,  three  cases  are  tested: 

Pristine  This  case  simulates  a  situation  where  there  are  no  mis-aligned  signatures  (that  is,  all 
signatures  fall  within  the  ±18  window  after  the  circular  centroid).  There  are  no  coarse 
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Figure  10.  AGC  Testing  Methodology  (See  Section  3.3.2) 
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alignment  problems  and  no  problem  signatures.  In  a  sense,  this  removes  alignment 
issues  from  the  classification  problem. 

Hand-Aligned  This  case  adds  the  bad  signatures  to  the  problem.  Usually,  bad  signatures  are 
observed  to  have  noise  flashes  which  cause  the  coarse  alignment  to  fail  (recall  Figure  9). 
These  bad  signatures  do  not  match  the  overall  energy  characteristics  of  the  rest  of  then- 
class,  but  are  forced  into  the  ±18  window  by  human  alignment.  Thus,  coarse  alignment 
is  performed  by  human  perception  and  the  CC  within  the  AGC  is  turned  off.  In  this 
case,  there  are  no  coarse  alignment  problems,  but  there  are  bad  signatures.  This  case 
tests  the  ability  of  the  classifier  to  handle  outliers  within  classes. 

Arbitrary  This  case  allows  the  AGC  system  (circular  centroid)  to  perform  the  coarse  align¬ 
ment  on  all  available  signatures.  Both  coarse  alignment  and  bad  signatures  affect  Pec- 
This  case  tests  the  performance  under  the  harshest  of  the  alignment  conditions. 

Pristine  vs.  Arbitrary  This  case  trains  on  pristine  data  only,  but  includes  bad  signatures  in 
the  test  set.  This  case  tests  whether  arbitrary  data  corrupts  the  template  being  generated 
to  some  extent. 

In  all  of  the  cases,  the  fine  alignment  process  remains  the  same. 

3.3.3  Implementation.  Trials  are  made  using  the  hold-out  method.  Half  the  data 
set  is  used  in  training  and  half  the  data  set  is  used  in  testing.  Matlab  code  randomly  splits  the 
data  into  the  training  and  testing  sets  for  independent  trials.  These  trials  are  independent  in 
the  sense  that,  for  each  trial,  training  and  test  sets  are  mutually  exclusive.  Trial  to  trial,  there 
will  be  common  vectors  in  the  testing  group.  This  fact  does  bias  the  results,  because  the  same 
test  vectors  may  be  repetitively  misclassified.  For  this  thesis,  the  desire  was  to  produce  some 
confidence  intervals  on  the  results  and  the  data  layout  was  well  suited  to  the  randomization 
procedure  devised.  Also,  the  statistics  of  the  classes  seem  to  be  well  represented  by  the  data 
available.  The  use  of  many  Monte  Carlo  repetitions  is  designed  to  bound  the  final  error  rates 
and  reduce  the  impact  of  presentation  order  to  the  training  portion  of  the  AGC.  Leave  one  out 
results  are  given  in  Appendix  B. 
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The  randomly  generated  training  and  test  sets  are  given  to  the  AGC  for  computation, 
and  Matlab  code  is  used  to  interpret  the  output  for  classification.  The  AGC  produces  a  matrix 
of  distances  where  each  row  corresponds  to  a  test  vector  and  each  column  is  its  distance,  or 
test  statistic,  with  respect  to  a  given  class.  Matlab  selects  the  minimum  value  from  each  row 
as  the  “winning”  class  assignment.  In  all  cases,  a  priori  probabilities  are  assumed  to  be  equal. 
Trials  continue  until  statistical  significance  between  overall  Pec’s  for  the  pristine  and  arbitrary 
cases  are  established.  This  takes  on  the  order  of  1750  trials  per  case.  The  hand-aligned  trials 
were  repeated  the  order  of  750  trials.  Results  from  the  trials  are  averaged  together  to  produce 
final  confusion  matrices.  All  runs  are  completed  on  Sun  Workstations. 

3.4  Impact  of  Classifier  Type 

LNKnet  implemented  the  non-parametric  classifier  comparisons.  Trials  were  executed 
with  respect  to  a  ik-NN  and  neural  network.  Because  alignment  was  not  being  tested,  the 
classifier  was  given  signatures  finely  aligned  with  the  actual  class  only.  This  biases  the  result 
in  favor  of  correct  classification,  but  still  gives  a  relative  measure  of  classification  capability 
for  this  high  dimensional  data  set.  For  comparison,  runs  were  also  made  against  a  Gaussian 
classifier  (implemented  on  LNKnet). 

3.4.1  Data  Preparation.  Extracting  features  from  the  AGC  was  accomplished  by 
going  into  the  C  code  and  writing  signatures  directly  to  a  binary  file  just  before  inclusion  in  the 
class  template  during  training.  Thus,  all  alignment  (both  coarse  and  fine)  and  preprocessing 
steps  are  accomplished  before  extracting  the  signature  from  the  AGC.  Matlab  was  used  to 
read  the  file  and  convert  it  to  ASCII,  the  required  format  for  LNKnet. 

3.4.2  Test  Criteria.  For  these  tests,  the  same  test  criteria  as  Section  3.3  apply. 
Confusion  matrices  and  97.5%  confidence  bounds  are  generated. 

3.4.3  Test  Methodology.  Test  methodology  is  the  same  as  above,  but  includes  only 
the  arbitrary  and  pristine  cases. 
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3.4.4  Implementation.  Runs  are  repeated  in  the  same  way  as  Section  3.3,  with  the 
exception  that  N-fold  cross  validation  is  utilized.  This  test  procedure  divides  all  data  into  equal 
subsets  and  then  iteratively  tests  on  one  subset  at  a  time,  while  training  on  the  rest.  The  concept 
is  the  same  as  with  leave  one  out  evaluation,  but  with  larger  test  sets.  In  this  thesis,  50-fold 
cross  validation  is  used  to  form  confusion  matrices  as  a  measure  of  classification  capability  as 
described  earlier.  The  number  of  trials  available  for  execution  was  limited  by  LNKnet.  This 
is  because  LNKnet  produces  massive  parameter  files  which  are  directly  proportional  to  the 
size  and  dimensionality  of  the  data. 

For  the  fc-NN  runs,  15  nearest  neighbors  are  used.  This  number  was  selected  because 
in  the  Bayes  experiments  below,  15  was  generally  found  to  be  a  value  that  produced  the 
best  results.  For  the  Gaussian  runs,  LNKnet  was  set  up  as  the  AGC,  with  class  dependent, 
uncorrelated  covariance  matrices. 

The  neural  network  nms  used  192  input  nodes,  25  hidden  nodes  in  one  layer,  and  4 
output  nodes.  Most  of  the  standard  settings  within  LNKnet  were  retained  for  the  neural 
network  runs.  This  included  a  step  size  of  .1,  a  momentum  of  .6,  a  standard  sigmoid  function 
in  the  nodes,  and  the  square-error  cost  function.  Random  presentation  order  was  turned  “on” 
and  50  epochs  were  used  during  training. 

For  each  classifier  type,  equal  class  a  prioris  are  assumed. 

3.5  Impact  of  Data 

This  experiment  is  conducted  according  to  Martin  (33)  and  bounds  the  best  possible 
classification  rates  one  can  hope  for  with  the  data  at  hand.  In  other  words,  it  is  a  measure 
of  the  separability  of  the  data.  Martin’s  algorithm  is  designed  only  for  two  class  problems. 
AGC  runs  were  accomplished  for  all  two  class  combinations  in  the  same  way  as  Section  3.3 
for  comparison. 

3.5.1  Data  Preparation.  Data  was  prepared  as  in  the  previous  section.  Like 
LNKnet,  Martin’s  code  expects  ASCII  formatted  data. 
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3.5.2  Test  Criteria.  Martin’s  code  produces  graphs  showing  the  error  rates  generated 
using  the  L  and  R  methods  described  in  Chapter  2.  Classification  results  are  expected  to  be 
relatively  high  if  the  data  is  separable  on  a  class  to  class  basis. 

3.5.3  Test  Methodology.  Methodology  followed  the  procedure  found  in  Martin’s 
thesis  (32),  using  only  ^-NN  evaluation  because  it  provides  the  most  consistent  estimates  of 
the  Bayes  error.  For  the  /:-NN,  the  parameter  k  was  varied  to  give  a  range  of  error  rates  as  a 
function  of  k.  Results  were  generated  by  taking  the  best  case  firom  the  range  of  k  allowed  by 
Martin’s  code.  As  mentioned  earlier,  A:  =  15  was  usually  the  best  case. 

3.5.4  Implementation.  Martin’s  code  performed  10-fold  cross  validation  during 
testing  and  produces  error  rates  with  respect  to  varying  the  k  parameter.  Option  2  was  used 
to  determine  the  decision  threshold  in  the  L  case.  The  results  given  in  the  next  chapter  were 
generated  by  taking  the  midpoint  between  the  R  and  L  case  as  the  estimate  of  the  Bayes  error. 
For  comparison,  class  to  class  runs  with  the  AGC  itself  were  made  using  the  pristine  and 
arbitrary  cases. 

3.6  Summary  of  AGC  Methodology 

The  first  part  of  this  chapter  describes  the  methodology  used  to  evaluate  the  performance 
of  the  AGC.  The  overall  thrust  of  the  first  section  is  to  analyze  preprocessing  effectiveness. 
The  idea  is  to  begin  with  pristine  data  and  add  in  one  of  two  potential  problems  at  a  time. 
The  first  step  includes  problem  data  (bad  signatures)  to  see  whether  it  is  something  in  the 
data  which  causes  the  fine  alignment  and/or  subsequent  classification  to  fail.  The  next  step 
adds  back  coarse  alignment  as  an  issue.  With  the  arbitrary  case,  the  machine  is  allowed  to 
do  the  coarse  alignment.  This  experiment  tests  the  effect  the  CC  is  having  on  the  AGC.  The 
overall  thrust  of  the  second  section  of  this  chapter  is  to  analyze  the  modality  of  the  data:  Is 
classification  hurt  by  limitations  of  a  Gaussian  quadratic  discriminant?  Alignment  is  removed 
as  an  issue.  The  third  section  addresses  the  separability  of  the  data  on  a  class  to  class  basis. 
Results  are  presented  in  Chapter  IV. 
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3.7  Discriminant  Analysis  of  the  UHRR  Feature  Space 

This  portion  of  the  thesis  explores  the  use  of  Lee’s  algorithm  for  feature  discrimination 
with  the  UHRR  problem.  After  validating  the  performance  of  the  algorithm  with  simple 
example  problems  given  by  Lee,  the  algorithm  was  applied  to  a  number  of  cases  of  the  UHRR 
problem.  Two  of  the  validation  examples  appear  in  Appendix  A. 

All  preprocessing  was  accomplished  with  Matlab  code  emulating  the  AGC  algorithm, 
including  alignment,  template  construction,  and  the  generation  of  test  statistics.  Data  was 
converted  to  Matlab  formatted  binary  files  and  then  classified  with  Matlab  code.  The  Matlab 
code  was  verified  by  testing  its  classification  accuracy  against  the  AGC,  and  was  found  to 
perform  the  same  for  given  data  sets. 

The  results  presented  in  Chapter  4  use  256  range  bins  because  of  the  eventual  extension 
to  wavelet  analysis,  but  similar  results  were  observed  for  192  range  bins.  The  algorithm 
was  run  in  the  four  class  case  for  pristine  data.  Several  two  class  situations  were  also  tested 
involving  selected  targets. 

The  test  sequence  proceeds  as  follows: 

1.  During  training,  save  template  files  for  use  as  estimates  of  class  means  and  variances. 

2.  Classify  the  test  data. 

3 .  Separate  correctly  classified  signatures  into  a  new  data  set  for  use  with  the  Lee  algorithm. 

4.  Provide  the  Lee  algorithm  with  results  from  Steps  1  and  3. 

5.  Generate  and  save  the  EDBFM,  its  eigenvectors,  and  its  eigenvalues. 

6.  Reclassify  the  data  using  a  specified  number  of  features. 

Lee’s  article  clearly  implies  that  reclassification  occurs  in  the  new,  transformed  “EDBFM 
space”  (29:397).  Thus,  one  selects  the  number  of  features  used  in  the  new  space  and  forms 
a  transformation  matrix  composed  of  the  eigenvectors  corresponding  to  the  eigenvalues  of 
greatest  magnitude.  This  selection  process  is  the  same  as  the  ordering  that  takes  place  in 
a  KLT.  For  this  thesis,  signature  vectors  are  row  vectors  and  the  transformation  matrix’s 
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columns  are  filled  with  eigenvectors.  To  generate  sample  vectors  in  the  reduced  feature  space, 
one  post-multiplies  a  matrix  of  UHRR  signatures  by  the  transformation  matrix.  This  post¬ 
multiplication  conveniently  executes  the  inner  products  required  to  perform  the  projections 
onto  the  new  basis  set. 

Transformed  training  and  test  sets  are  used  during  “re-”classification.  Both  correctly 
classified  sample  points  and  incorrectly  classified  points  are  used  during  final  testing.  Dom¬ 
inant  eigenvectors  are  added  to  the  transformation  matrix  over  several  runs  to  get  a  feel  for 
how  many  features  are  required  to  maintain  the  original  classification  rate. 

At  this  point,  an  important  caveat  to  the  Lee  algorithm  becomes  evident.  When  pro¬ 
jecting  into  the  EDBFM  space,  one  is  using  linear  combinations  of  all  the  original  features. 
Thus,  it  seems,  there  is  no  advantage  to  reducing  the  overall  dimensionality  of  the  problem. 
One  may  infer  from  Lee’s  article  that  the  dominant  eigenvectors  correspond  to  the  dominant 
features  (29:396).  Also,  one  may  infer  that  one  is  reducing  the  original  feature  space,  but  the 
technique  is  actually  performing  a  projection  into  a  space  with  fewer  features  (dimensions) 
(29:389,391). 

An  attempt  was  made  to  interpret  the  EDBFM ’s  dominant  eigenvectors  in  order  to  infer 
where  the  discriminantly  relevant  information  lies  in  the  original  space.  Referring  to  the  3- 
dimensional  problem  in  Appendix  A,  one  sees  that  the  primary  eigenvector’s  most  significant 
components  correspond  to  the  discriminantly  relevant  features  in  the  original  space  (i.e.  the 
first  two  dimensions).  This  interpretation  has  an  intuitive  appeal  because  in  performing  an 
inner  product,  those  components  that  are  “most  important”  to  the  transformation  will  be  those 
components  that  are  weighted  heaviest  in  the  inner  product.  For  example,  in  the  case  of 
projecting  a  two  dimensional  vector  onto  the  x-axis,  the  y-component  of  the  vector  plays  no 
role  in  the  projection.  In  more  complicated  situations,  where  there  are  non-zero  elements  of 
the  projection  operator  being  ignored,  one  is  throwing  away  some  information  about  how  the 
transformation  occurs. 
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The  most  dominant  eigenvectors  and  the  least  significant  eigenvectors  tend  to  “point 
to”  very  different  feature  elements.  Again,  these  eigenvectors  are  ordered  with  respect  to  the 
magnitude  of  the  corresponding  eigenvalue.  Figure  1 1  shows  the  disparity. 


Least  SIgnIfIcarrt  Eigenvector  Most  Significant  Eigenvector 


Figure  11.  Comparing  Eigenvectors:  4  Class,  Pristine  Data 


The  key  point  in  this  analysis,  however,  is  that  the  dominant  eigenvectors  may  provide 
information  about  the  discrimination  power  of  features  in  the  original  space.  In  Figure  11, 
features  80-90,  105-115,  and  150-160  would  be  interpreted  as  important  to  discrimination. 
This  idea  is  tested  by  using  dominant  eigenvectors  to  choose  features  in  the  original  space 
and  then  re-classifying.  This  technique  is  successfully  applied  in  Kocur’s  work  on  breast 
cancer  detection  (25).  As  above,  the  most  significant  eigenvectors  direct  one’s  attention  to 
specific  features.  Her  choices  for  relevant  features,  based  on  anlysis  of  the  EDBFM  agree 
with  Steppe’s  analysis  of  feature  saliency  with  respect  to  neural  networks  (43). 

Features  are  selected  by  averaging  the  ten  most  significant  eigenvectors  and  choosing  the 
elements  with  significant  magnitudes  as  corresponding  to  relevant  features.  The  selection  of 
ten  eigenvectors  had  the  best  classification  among  the  options  of  choosing  the  best  eigenvector 
only  or  averaging  the  five  most  significant  features.  Various  numbers  of  features  are  used  and 
tabulated. 
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3.8  Summary 

After  describing  methodology  for  the  underlying  pattern  recognition  problem,  this 
portion  of  the  thesis  outlines  a  feature  discrimination  technique  for  the  UHRR  radar  problem. 
The  advantage  of  Lee’s  approach  is  to  base  feature  saliency  on  classification  error  instead  of 
reconstruction  error.  Lee  finds  a  rotated  space  in  which  a  reduced  number  of  features  are 
required  to  attain  about  the  same  classification  rate  as  the  original  space.  The  point  addressed 
in  this  thesis  is  that  one  may  attain  true  feature  reduction  by  interpreting  the  eigenvectors  of 
the  EDBFM.  Results  comparing  the  performance  of  the  feature  discrimination  approaches  are 
given  in  Chapter  4. 
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IV.  Results 


4.1  Introduction 

This  chapter  presents  and  summarizes  results  from  the  previous  chapter.  Results  are 
presented  in  the  form  of  confusion  matrices  for  the  full,  four-class  comparisons.  97.5% 
confidence  intervals  are  given  for  all  estimates  of  the  classification  rates.  For  two-class 
comparisons,  results  are  summarized  in  aggregated  tables. 


4.2  Impact  of  Alignment 

Table  2  shows  overall  classification  performance  and  the  next  three  tables  show  the 
interclass  details.  Several  observations  and  conclusions  may  be  made  from  each  of  the  tables. 
Overall,  Table  2  shows  there  is  a  statistically  significant  alignment  problem  in  the  four  class 
case.  The  average  Pec’s  presented  here  were  observed  to  be  stable  and  more  trials  (on  the 
order  of  thousands)  would  show  statistical  separation  even  for  the  hand  aligned  case. 


Table  2.  AGC:  Overall  Pec 


Case 

. 

Pcc 

(%) 

97.5  %  Confidence 
Interval  (%) 

Pristine 

90.16 

±1.31 

Hand  Aligned 

88.66 

±2.76 

Arbitrary 

87.14 

±1.59 

Pristine  vs. 
Arbitrary 

82.60 

±3.32 

The  results  presented  in  this  section  show  that  the  “bad”  data,  represented  by  the  hand- 
aligned  row,  affects  classification  to  some  extent  and  the  inability  of  the  circular  centroid  to 
coarsely  align  the  bad  data  affects  classification  to  an  approximately  equal  extent.  The  two 
factors  combine  to  yield  a  three  percent  degradation  in  performance  in  the  overall,  4-class 
case.  The  impact  of  aligiunent  and  data  problems,  however,  are  class  variant  as  shown  by  the 
individual  rows  of  the  confusion  matrices. 
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Recalling  Table  1 ,  the  AGC  has  the  most  difficulty  with  class  f2  with  respect  to  alignment 
because  it  had  the  highest  percentage  of  bad  signatures.  As  shown  in  Tables  3  through  6, 
the  AGC’s  ability  to  classify  members  of  f2  degrades  by  ten  percent  from  the  pristine  case 
to  the  arbitrary  case.  Similarly,  class  fa  degrades  by  about  four  percent.  The  performance 
of  the  other  two  classes  remains  relatively  constant  across  the  three  cases,  with  9a  always 
performing  poorly  and  aa  always  performing  rather  well. 

This  result  implies  that  the  alignment  problem  can  be  extremely  data  dependent.  The 
fact  that  class  9a  has  poor  performance  across  all  three  cases  says  that  the  class  may  have  more 
than  one  cluster  center  in  the  feature  space.  Here,  the  unimodal  assumption  of  the  Gaussian 
classifier  hurts  performance.  The  fact  that  class  aa  performs  well  across  all  three  cases  is 
because  there  are  very  few  samples  characterized  as  bad  and  that  it  is  well  separated  from 
the  other  classes.  Measures  like  the  Fisher  discriminant  would  probably  work  well  with  this 
class,  and  perhaps  the  f2  and  fa  since  they  give  very  good  results.  However,  the  issue  that  will 
always  play  a  role  in  any  such  analysis  is  how  one  has  decided  to  make  alignment  decisions. 
Classes  f2  and  fa  have  the  most  bad  signatures  the  classification  rates  degrade  sharply,  as 
would  be  expected. 

In  the  case  of  training  on  pristine  data  and  testing  on  arbitrary  data,  note  that  the  AGC 
appears  capable  of  “learning”  the  misalignments  induced  by  bad  data.  This  is  shown  by 
comparing  Table  6  with  the  other  tables. 


Table  3.  AGC:  Pristine  Data 


Actual 

Class 

9a 

Assigne 

aa 

d  Class 
f2 

fa 

Pcc 

(%) 

97.5  %  Confidence 
Interval  (%) 

9a 

374.03 

49.58 

1.55 

67.84 

75.87 

±1.89 

aa 

.572 

499.11 

0.00 

3.32 

99.03 

±0.43 

f2 

.038 

2.96 

285.32 

6.68 

96.72 

±0.77 

fa 

8.92 

30.49 

3.27 

456.32 

91.44 

±1.67 
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Table  4.  AGC:  Hand  Aligned  Data 


9a 

Assign 

aa 

ed  Class 
f2 

fa 

Pcc 

(%) 

97.5  %  Confidence 
Interval  (%) 

9a 

372.76 

58.74 

1.56 

70.94 

73.96 

±3.82 

aa 

2.25 

495.8 

0.01 

5.94 

98.72 

±0.01 

f2 

4.46 

12.47 

484.78 

13.29 

94.13 

±2.04 

fa 

14.97 

43.46 

3.56 

442.01 

87.70 

±2.86 

Table  5.  AGC:  Arbitrary  Data 


Actual 

Class 

9a 

Assigne 

aa 

d  Class 
f2 

fa 

Pcc 

{%) 

97.5  %  Confidence 
Interval  (%) 

9a 

384.54 

54.01 

1.35 

64.10 

76.30 

±2.02 

aa 

12% 

490.20 

0.00 

6.52 

97.26 

±0.78 

i^i 

10.70 

27.82 

440.14 

36.34 

85.46 

±1.68 

fa 

8.62 

41.39 

2.55 

451.43 

±1.45 

Table  6.  AGC:  Train  Pristine/’ 


Test  Arbitrary 


9a 

Assigne 

aa 

.d  Class 

f2 

fa 

Pcc 

(%) 

97.5  %  Confidence 
Interval  (%) 

9a 

370.76 

74.68 

1.68 

69.87 

71.71 

±3.95 

aa 

1.75 

498.74 

0.40 

5.51 

98.57 

±1.04 

f2 

43.63 

14.41 

541.61 

137.36 

73.49 

±3.87 

fa 

8.73 

34.04 

3.41 

464.83 

90.96 

±2.51 
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4.3  Impact  of  Classifier  Type 


The  results  in  this  section  show  that  some  advantage  may  be  gained  by  using  non- 
parametric  techniques,  although  the  improvement  is  not  particularly  impressive.  Results  are 
given  in  Table  7  for  both  the  arbitrary  and  pristine  cases.  Note  that  once  the  alignment  problem 
is  “solved,”  results  for  the  neural  network  technique  converges,  while  the  Gaussian  classifier 
and  ife-NN  still  show  the  influence  of  the  bad  data  points.  Because  just  fifty  repetitions  are 
accomplished  through  the  cross-validation  process,  the  97.5%  confidence  interval  is  on  the 
order  of  ±5  %  and  results  shown  do  not  have  statistical  separation.  It  seems  that  the  Gaussian 
classifier  is  adequately  modelling  the  data.  There  is  a  difference  in  values,  however. 


Table  7.  Comparison  of  Classifiers 


Classifier 

Pristine  Case  (Pcc{%)) 

Arbitrary  Case  (Pcc{%) ) 

k-m 

96.48 

94.85 

Neural  Network 

96.57 

96.55 

Gaussian 

95.75 

92.74 

4.4  Impact  of  Data 

In  this  section,  results  from  the  Bayes  bounding  experiments  (Table  8)  and  2-class  AGC 
runs  (Table  9)  are  presented  and  compared.  Dashes  indicate  that  the  k-NN  “learned”  the  data. 
In  other  words,  it  perfectly  reclassified  the  data  for  all  cross  validation  runs  and  all  values  of 
k.  The  confidence  intervals  on  Bayes  estimates  are  given  by  Martin  to  be  2.26  %  for  a  97.5  % 
level  of  confidence. 


Table  8.  Bayes  Bound 

Estimates  (Option  2):  Class  by  Class 

Classes 

Pristine  Case 
Pcc{%) 

Hand  Aligned  Case 

Pcc{%) 

Arbitrary  Case 

Pcc(%) 

9a — aa 

99.55 

99.10 

98.50 

9a— f2 

- 

99.72 

99.65 

9a — fa 

95.25 

94.20 

94.50 

aa — f2 

- 

99.80 

99.99 

aa — fa 

- 

99.05 

99.10 

f2— fa 

99.94 

- 

- 
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Table  9.  AGC:  Class  by  Class 


Classes 

Pristii 

PM 

le  Case 
Conf(%) 

Hand  A1 

PM 

igned  Case 
Conf(%) 

Arbitr 

PM 

ary  Case 
Conf(%) 

9a — aa 

94.01 

±2.06 

93.23 

±2.11 

92.93 

±2.25 

9a— f2 

98.32 

±1.13 

97.42 

±1.38 

94.31 

MWTMI 

90.85 

±2.22 

88.81 

±2.76 

91.30 

±2.47 

aa — f2 

97.96 

±1.16 

98.12 

±1.18 

95.56 

aa — fa 

96.44 

±1.62 

94.74 

±1.96 

94.78 

msEm 

lisai 

97.98 

±1.23 

97.09 

±1.46 

94.10 

■wriMi 

Several  observations  may  be  made  from  these  results.  First,  the  data  in  a  class  to  class 
sense  appears  to  be  nicely  separable.  Only  the  error  rates  of  the  9a  -  fa  combination  seems 
to  overlap  in  the  two  class  feature  space.  Also  note  f2  seems  to  show  adequate  separation  in 
each  two  class  situation.  The  high  performance  seen  here  is  made  in  light  of  the  fact  that  the 
alignment  problem  is  “solved,”  as  described  in  the  last  chapter. 

One  may  compare  Table  9  with  the  tables  from  Section  4.2  to  see  how  performance 
jumps  in  this  simpler,  two-class  problem.  But  even  this  result  is  affected  by  the  alignment 
issue,  even  though  alignment  plays  its  normal  role  in  the  results  of  Table  9.  In  the  two  class 
situations,  there  is  only  one  class  to  be  improperly  aligned  against.  Note  how  class  £2  is  not 
confused  with  other  classes  to  the  same  extent  as  in  the  four  class  case,  although  it  is  still  the 
second  worst  performer.  Class  fa  still  performs  relatively  poorly.  Also  note  that  alignment  is 
the  issue  that  is  intertwined  in  this  specific  case  because  it  is  the  class  that  the  circular  centroid 
seems  to  struggle  with  the  most. 


4.5  Feature  Discrimination 

Featme  discrimination  using  the  EDBFM  was  performed  in  the  transformed,  “EDBFM” 
space,  as  well  as  the  original  space. 

4.5.1  Discrimination  in  the  Transformed  Space.  The  results  were  generated  by 
utilizing  subsets  of  eigenvectors  sorted  by  magnitude  of  eigenvalue.  These  subsets  form  the 
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columns  of  the  transformation  matrix  and  are  applied  to  the  data  as  described  in  the  last 
chapter. 

Figure  12  shows  results  for  a  two  class  problem  and  four  class  problem.  This  diagram 
shows  excellent  classification  rates  using  just  five  to  ten  percent  of  the  available  features  in  the 
transformed  space.  The  peaking  effect  seen  in  the  four  class  case  is  troubling  because  all  that 
has  been  done  is  a  linear  transformation.  An  explanation  for  the  roll  off  may  be  that  the  the 
transformation  is  inadequately  represented  because  of  the  assumption  of  uncorrelated  featres 
to  begin  with.  In  other  words,  the  data  is  inadequately  represented  by  that  assumption  and 
more  data  is  required  to  properly  calculate  the  full  covariance  matrix.  In  the  end,  it  appears  that 
the  new  feature  space  is  vulnerable  to  the  curse  of  dimensionality  issue  discussed  in  Chapter 
2. 


2  Class  Rates,  Translormed  Data  4  Class  Rates,  Translomted  Data 


Figure  12.  Reclassification  Curves,  Transformed  Data 


4.5.2  Discrimination  in  the  Original  Space.  This  section  demonstrates  the  capa¬ 
bility  to  reclassify  in  the  original  space  by  interpreting  the  most  significant  eigenvectors  of 
the  EDBFM.  The  average  of  the  ten  most  significant  eigenvectors  is  shown  in  Figure  13.  As 
described  in  Chapter  3,  the  elements  of  the  eigenvector  selected  first  are  those  with  the  highest 
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magnitude.  These  elements  indicate  the  features  to  focus  on  during  reclassification.  In  the  two 
class  problem,  it  appears  that  different  areas  along  the  signature  provide  important  information 
for  discrimination.  One  would  have  to  know  the  specifics  of  the  aircraft  under  test  and  the  test 
environment  to  make  in-depth  inferences.  In  both  cases  the  eigenvectors  represent  an  average 
representation  of  the  significant  features  with  respect  to  each  two  class  situation.  Thus,  it  is 
hard  to  interpret  specific  areas  of  interest.  However,  it  seems  clear  that  in  the  four  class  case, 
leading  edge  information  tends  to  dominate  the  decision.  In  the  two  class  case,  there  are  many 
different  areas  that  provide  discriminantly  relevant  information.  Note  that  the  eigenvectors 
emphasize  very  different  sets  of  features  in  each  problem.  This  is  not  surprising  because  the 
calculation  of  the  EDBFM  is  very  data  dependent. 


Average  of  Elgenvectora:  2  Clasa  Average  of  Eigenvectors:  4  Qass 


Figure  13.  Average  of  Ten  Most  Significant  Eigenvectors 


Figure  14  shows  the  resulting  reclassification  rates.  Once  again,  results  are  quite  good. 
In  fact,  when  using  less  than  half  the  features,  reclassifying  in  the  original  space  out  performs 
reclassifying  in  the  transformed  space.  Reclassifying  in  the  transformed  space  tends  to  peak 
at  a  higher  level  than  reclassifying  in  the  original  space,  however.  Once  again,  in  the  four 
class  environment  there  is  a  peaking  and  roll  off  of  performance  as  more  and  more  features  are 
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added.  In  this  case,  features  are  treated  as  uncorrelated  (via  the  main-diagonal-only  covariance 
matrix),  so  the  roll  off  is  not  as  pronounced.  Recall  from  Chapter  2  that  this  effect  is  predicted 
by  Chandrasekaran  (5)  and  has  to  do  with  the  curse  of  dimensionality. 


2  Class  Rates,  Original  Data  4  Class  Rates,  Original  Data 


Figure  14.  Reclassification  Curve,  4-Class,  Pristine  Data:  Original  Space 


4.6  Summary 

This  chapter  shows  all  the  results  which  accomplish  the  most  of  the  goals  of  thesis:  to 
form  an  estimate  of  the  AGC’s  performance  with  respect  to  alignment  and  to  perform  feature 
saliency  analyses.  Results  show  that  alignment  is  the  critical  issue  in  the  UHRR  radar  problem. 
Classification  rates  drop  over  1 1%  for  the  f2  class.  In  the  overall  classification  rate,  though, 
degradation  is  on  the  order  of  3%. 

Comparison  of  parametric  and  non-parametric  classifiers  shows  that  there  is  some,  but 
not  a  major,  price  to  be  paid  with  the  assumption  of  Gaussian  distributions.  Classification 
rates  hover  around  95%  and  are  not  statistically  significant  at  a  97.5%  level  of  confidence. 

Despite  data  that  is  well  separated  (as  indicated  by  classification  rates  above  99%  during 
the  Bayes  bounding  experiments),  the  ability  to  properly  declare  a  starting  point  for  the  target 
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within  a  radar  signature  is  critical  to  the  analysis.  The  approach  used  by  the  AGC  intertwines 
the  alignment  judgment  with  the  final  classification  judgment  when  the  issues  may  be  more 
properly  handled  separately.  The  starting  point  of  a  target  in  a  noisy  signal  should  be  unrelated 
to  its  actual  class.  Thus,  the  decision  of  where  the  signal  begins  and  what  the  signal  is  should 
also  be  distinct. 

Furthermore,  this  chapter  implements  and  applies  Lee’s  analysis  of  the  EDBFM  to  the 
UHRR  radar  problem.  His  analysis  is  extended  in  this  thesis  by  interpreting  the  eigenvectors 
of  the  EDBFM  to  identify  relevant  features  in  the  original  feature  space.  True  feature  reduction 
in  the  original  feature  space  is  attained  with  results  meeting  or  exceeding  performance  with 
the  full  set  of  features.  This  can  be  attained  using  as  little  as  5%  of  the  original  features  (see 
Figure  14). 

In  this  problem,  classification  in  the  original  space  by  interpreting  the  eigenvectors 
of  the  EDBFM  outperforms  Lee’s  original  method  for  small  feature  sets.  As  more  features 
are  added,  the  Lee  transformation  method  peaks  above  the  method  presented  in  this  thesis. 
However,  because  features  become  correlated,  performance  is  hurt  as  the  number  of  features 
rises  past  225.  Both  methods  achieve  peak  performance  which  meet  or  exceed  classification  in 
the  original  space  with  a  full  set  of  features.  Classification  of  data  in  the  original  space  shows 
excellent  results  here  and  has  been  used  successfully  with  other  UHRR  class  combinations  and 
mammography  data  (25).  Chapter  5  extends  the  use  of  Lee’s  algorithm  to  selecting  appropriate 
wavelet  scales  with  respect  to  a  two  class  UHRR  radar  problem. 
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V.  Utilizing  Feature  Discrimination  and  Wavelet  Transformations 

5.1  Introduction 

The  purpose  of  this  chapter  is  to  use  a  multiresolution  analysis  (MRA)  of  UHRR  data 
and  demonstrate  that  relevant  scales  may  be  determined  from  the  EDBFM.  Choosing  wavelet 
bands  based  on  discrimination  power  is  shown  to  have  an  advantage  in  classification  over 
choosing  wavelet  bands  with  respect  to  reconstruction  error  for  this  data. 

5.2  Implementation 

5.2.1  Data.  The  data  demonstrating  the  results  are  identical  to  the  2  class  problem 
in  Chapter  4,  Section  4.5:  pristine  data,  classes  f2  and  fa.  256-element,  raw  feature  vectors 
are  used  because  the  wavelet  code  used  works  best  with  signals  which  are  powers  of  two. 

5.2.2  Alignment  Issues.  As  mentioned  in  Chapter  2,  implementing  a  wavelet- 
based  extraction  and  classification  scheme  requires  special  attention  to  properly  aligning  data 
before  decomposing  them.  Classification  of  UHRR  signatures  failed  when  the  following 
process  excluded  the  aligmnent  steps.  In  this  case,  the  circular  centroid  is  applied  only  to  the 
original  signal  to  solve  the  wrap-around  problem.  Coarse  alignment  is  irrelevant  because  full 
correlations  are  used  to  align  the  wavelet  features.  The  overall  scheme  is  shown  in  Figure  15. 
To  accomplish  this,  wavelet  extraction  was  performed  in  a  two  step  process. 

First,  Suzuki’s  algorithm  is  applied  to  form  shift-invariant  wavelet  representations. 
Daubechies  20-tap  wavelets  are  used.  The  problem  of  edge  effects  is  handled  by  extending 
the  signal  on  each  end  by  the  average  of  the  nearest  50  bins.  In  other  words,  instead  of  zero 
padding  or  periodizing,  an  average  of  the  first  50  range  bins  is  taken  and  concatenated  to  the 
“front”  of  the  signal.  Likewise,  an  average  of  the  last  50  bins  is  taken  and  concatenated  to  the 
“tail”  of  the  signal.  The  averaging  was  found  necessary  because  the  signatures  are  not  nearly 
equal  at  their  ends  and  periodic  extensions  of  the  signal  would  introduce  discontinuities  into 
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Section  Reference 


Circular  Centroid 


Interleaved  Multiresolution  Anal>Ms 


Full  Correlation  Alignment 


Removal  of  Redundant  Elements 
to  Produce  "Wavelet  Features" 


Tram  and  Classify 


Produce  Energy  Map 


Calculate  EDBFM  and  Eigenvectors 


Prioritize  Wavelet  Bands 


Prioritize  Wavelet  Bands 


Train  and  Classify  Using  'Wavelet  Features' 


Figure  15.  Wavelet  Feature  Extraction 


66 


the  analysis.  Also,  signal  extension,  as  opposed  to  periodization,  is  more  compatible  with 
Suzuki ’s  code. 

Equal  numbers  of  points  are  added  to  the  beginning  and  end  of  the  signal  to  form  a 
signal  of  length  1024.  Suzuki’s  code  assumes  zero  padding  beyond  this  range.  Thus,  there 
will  be  more  pronounced  edge  effects  where  the  elongated  signal  goes  to  zero.  The  linear 
convolution  process  causes  edge  effects  to  creep  slowly  toward  the  center  of  the  elongated 
signal  with  successive  filtering.  But  these  edge  effects  are  removed  by  truncating  the  signal 
back  to  length  256. 

Each  vector  is  fully  decomposed  into  eight  scales.  These  resulting  vectors  are  all  256 
elements  long  and  contain  redundant  information  about  the  signal.  However,  they  also  include 
information  that  can  be  used  to  align  exemplars.  During  the  correlation  process,  each  level  of 
decomposition  is  circularly  correlated  with  the  corresponding  level  of  decomposition  in  the 
class  template.  The  best  match  is  stored  in  a  feature  matrix  which  ends  up  being  16  by  256 
elements  holding  8  approximation  levels  and  8  detail  levels. 

5.2.3  Formation  of  Wavelet  Feature  Vectors.  Each  level  of  decomposition  in  this 
wavelet  feature  matrix  is  then  downsampled  to  remove  terms  containing  redundant  informa¬ 
tion.  The  first  level  of  decomposition  is  downsampled  by  2,  the  second  level  by  4,  etc.  This  is 
done  because  Suzuki’s  code,  essentially,  has  interleaved  the  wavelet  decompositions  together 
(46).  All  of  the  downsampled  detail  coefficients  and  the  eighth  level  approximation  coefficient 
are  retained.  The  results  from  this  downsampling  process  are  stored  in  a  final,  256  element 
feature  vector.  This  vector  contains  all  of  the  downsampled  detail  signals  and  the  single, 
lowest  level,  approximation  coefficient.  Figure  16  shows  the  organization  of  a  wavelet  feature 
vector. 

These  wavelet  features  are  used  for  training  and  testing  in  a  Gaussian  classifier,  as  in  the 
rest  of  the  thesis.  During  training,  the  wavelet  exemplar  is  added  to  a  recursively  calculated 
template.  During  testing,  a  256-element  wavelet  exemplar  is  aligned  against  each  class 
template  before  declaring  an  assigned  class.  The  Lee  analysis  is  applied  using  the  template 
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Figure  16.  Correspondence  of  Feature  Vector  to  Wavelet  Scales 

(containing  mean  and  variance  information  for  each  element  of  the  wavelet  feature  vector) 
as  estimates  for  the  Gaussian  classifier.  Correctly  classified  vectors  are  used  to  generate  the 
EDBFM  and  its  eigenvectors. 

5.2.4  Examination  of  the  Energy  Map.  To  extract  the  full  energy  map  described  by 
Chang,  the  wavelet  packet  recursion  is  applied  on  wavelet  feature  vectors  from  the  training 
set.  For  this  analysis,  code  by  Myers  (35)  is  used,  again  with  Daubechies  20  tap  filters.  As 
mentioned  in  Chapter  2,  this  is  a  recursive  filtering  process  applied  to  all  approximations 
and  all  details.  In  this  case,  edge  effects  are  handled  by  mirror-extending  the  signal  being 
decomposed  and  performing  circular  convolutions.  The  li  norm  of  each  leaf  in  the  tree  is 
stored  in  the  energy  map  and  averaged  to  form  the  template. 

5.3  Evaluation  of  Significant  Wavelet  Bands 

Wavelet  bands  are  rank  ordered  separately  using  the  energy  map  information  and  the 
EDBFM  information.  Classification  results  using  coefficients  prioritized  by  each  method  are 
generated. 
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According  to  Chang,  significant  wavelet  bands  at  a  given  level  of  decomposition  in 
the  energy  map  are  those  with  highest  calculated  energy.  If  this  energy  surpasses  a  certain 
threshold,  as  a  percentage  of  the  highest  energy  band  at  that  level,  then  that  detail  is  relevant  to 
the  reconstruction  of  the  signal.  The  idea  is  applied  in  this  thesis  by  noting  that  at  each  scale, 
the  approximation  coefficients  had  the  most  significant  Zi  norm.  This  is  not  surprising  because 
that  is  where  the  DC  information  is.  Also,  at  each  level,  the  detail  coefficients  associated  with 
the  conventional  wavelet  decomposition  (those  details  not  derived  as  a  part  of  the  wavelet 
packet  recursion),  had  the  second  most  significant  bands.  The  energy  of  the  approximation 
or  detail  level  is  designated  by  Eai  or  Es,  respectively.  is  compared  to  Eai  and  relative 
percentages  are  recorded  to  rank  order  the  significance  of  the  wavelet  coefficients. 

Additionally,  energy  values  are  compared  across  scales  by  normalizing  the  Zi  measure 
by  a  factor  of  \/2  for  each  additional  level  of  decomposition.  This  is  done  because  Myers’ 
code  does  not  include  the  inverse  of  that  term  in  his  filtering  routine.  As  shown  in  Table  10, 
the  ranking  methods  agree.  Note  that  the  single  approximation  coefficient  is  actually  ranked 
more  importantly  than  all  the  detail  levels  with  an  energy  measure  of  .7433. 


Table  10.  Wavelet  Se 

lection:  Entropy 

Detail 

Edi/ Eai 

Rank 

Normalized 

Rank 

Level 

Order 

h  Energy 

Order 

1 

.0782 

.0828 

HHI 

2 

.1100 

3 

.1165 

3 

3 

.0964 

5 

.1022 

5 

4 

.0999 

4 

.1057 

KHI 

5 

.0798 

6 

.0843 

^1 

6 

.0693 

8 

.0719 

8 

7 

.1700 

2 

.1498 

2 

8 

.2500 

1 

.1857 

1 

Alternatively,  discriminantly  relevant  features  according  to  the  EDBFM  are  predicted. 
The  EDBFM ’s  eigenvectors  will  be  interpreted  to  highlight  those  wavelet  coefficients  that  are 
discriminantly  relevant.  Since  the  interpretation  of  eigenvectors  points  to  individual  features, 
instead  of  groups  of  features,  some  measure  must  be  applied  to  select  relevant  scales.  The 
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criterion  is  to  take  the  average  energies  of  the  portions  of  the  eigenvectors  that  correspond 
to  each  wavelet  band.  The  most  significant  eigenvector  is  used  for  interpretation.  The 
correspondence  of  its  elements  to  the  wavelet  bands  is  shown  in  Figure  17.  The  prioritization 
results  are  shown  in  Table  11  and  are  compared  with  the  results  from  the  entropy-based 
analysis.  The  methods  choose  different  rankings  for  the  scales. 
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Figure  17.  Correspondence  of  Eigenvector  to  Wavelet  Bands 


Table  11.  Wavelet  Selection;  EDBFM  Eigenvector  Interpretation 


Detail 

Level 

EDBFM  Eigenvector 
Analysis 

EDBFM 

Rank  Order 

Entropy 
Rank  Order 

1 

.0409 

5 

7 
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5.4  Classification  Using  Specified  Wavelet  Scales 

The  test  sets  containing  wavelet  features  are  now  submitted  for  reclassification  as  in  the 
case  of  reclassification  in  the  original  space  of  Chapter  4.  Results  are  generated  by  utilizing 
only  those  scales  indicated  by  Table  1 1  and  are  shown  in  Tables  12  and  13.  In  the  tables,  “Da;” 
means  the  detail  level  x  and  “As”  means  the  approximation  level  x. 


Table  12.  Wavelet  Classification  Results:  Entropy 


Levels 

Included 

Number  of 
Features 

Classification 
Accuracy  (%) 

A8,D8 

2 

70.91 

A8,D8,D7 

4 

91.44 

A8,D8,D7,D2 

68 

92.57 

A8,D8,D7,D2,D4 

84 

92.82 

A8,D8,D7,D2,D4,D3 

116 

93.07 

A8,D8,D7,D2,D4,D3,D1 

244 

92.44 

A8,D8,D7,D2,D4,D3,D1,D5 

252 

92.56 

A8,D8,D7,D2,D4,D3,D1,D5,D6 

256 

92.56 

Table  13.  Wavelet  Classification  Results:  EDBFM  Eigenvector  Interpretation 


Levels 

Included 

Number  of 
Features 

Classification 
Accuracy  (%) 

D5 

8 

94.33 

D5,D6 

12 

94.08 

D5,D6,D7 

14 

93.95 

D5,D6,D7,D4 

30 

94.45 

D5,D6,D7,D4,D1 

158 

93.20 

D5,D6,D7,D4,D1,D3 

190 

92.56 

D5,D6,D7,D4,D1,D3,D2 

254 

92.56 

D5,D6,D7,D4,D1,D3,D2,D8,A8 

256 

92.56 

The  EDBFM  shows  maximum  performance  with  four  detail  scales  and  30  feature 
elements.  In  fact,  using  just  one  scale,  D5,  one  can  out  perform  the  classification  using  the 
entire  feature  set.  The  results  given  in  the  tables  show  that  definite  advantage  is  gained  by 
transforming  data  specifically  with  respect  to  classification  error. 
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5.5  Summary 

This  chapter  has  shown  that  relevant  wavelet  scales  with  respect  to  classification  may 
be  chosen  using  decision  boundary  analysis.  These  results  are  compared  against  Chang’s 
method  which  uses  an  entropy  measure  to  determine  relevant  wavelet  scales.  The  entropy 
measure  technique  has  recently  been  updated  by  Coifman  to  be  geared  towards  classification 
error  instead  of  reconstruction  error  (9).  Even  though  both  ranking  methods  used  here  produce 
classification  rates  that  peak  above  90%,  these  results  show  the  disparity  between  feature  dis¬ 
crimination  techniques.  Also,  the  applicability  of  the  interpretation  of  EDBFM  eigenvectors 
to  feature  reduction  in  the  original  (non-transformed)  space  is  verified.  Eigenvector  inter¬ 
pretation  has  selected  one  wavelet  scale  containing  eight  feature  elements  that  produces  94% 
accuracy.  This  accuracy  is  comparable  to,  but  not  as  good  as,  the  result  obtained  in  Chapter  IV, 
where  original  data  produced  a  classification  rate  near  97%  with  only  eight  features. 
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VI.  Conclusion 


6.1  Introduction 

For  UHRR  radar,  the  underlying  problem  of  aligning  exemplars  to  templates  for  training 
and  classification  is  a  formidable  challenge.  In  the  face  of  this  challenge,  this  research  had  a 
two-fold  purpose.  The  initial  portion  of  the  thesis  examined  the  AGC  with  respect  to  properly 
aligning  exemplars  to  class  templates.  A  baseline  of  the  AGC’s  performance  was  sought  to 
assess  the  impact  of  the  alignment  problem.  The  second  line  of  exploration  looked  at  feature 
discrimination  in  the  context  of  the  UHRR  radar  problem.  This  included  applying  alternative 
feature  selection  techniques  and  the  use  of  an  MRA  to  find  discriminantly  relevant  features 
and  thus,  scales. 

The  techniques  used  to  accomplish  these  goals  come  firom  several  broad  theoretical 
areas.  The  first  goal  relies  on  the  fundamental  theories  from  statistical  pattern  recognition. 
The  second  goal  relies  on  insights  from  linear  algebra  and  uses  wavelets  in  an  MRA  of  the 
data.  The  fundamentals  and  literature  search  associated  are  presented  in  Chapter  II.  Chapter  II 
provides  theory  for  each  of  the  methods  applied  in  Chapter  III  and  the  MRA  used  in  Chapter 
V.  Chapter  IV  presents  the  results  for  AGC  baselining,  including  alignment  analysis,  classifier 
analysis  and  data  analysis.  Also,  feature  selection  with  respect  to  the  decision  boundary  is 
developed.  The  selection  algorithm  is  extended  by  suggesting  a  method  for  using  features  in 
the  original  space  that  does  not  require  linear  combinations  of  the  features.  The  new  method 
selects  specific  features  in  the  original  space  that  are  discriminantly  relevant.  Chapter  V 
applies  similar  feature  selection  analysis  to  choosing  relevant  wavelet  scales. 

6.2  Summary  of  Key  Results 

6.2.1  AGC  Baseline.  An  analysis  of  the  AGC  training  and  testing  algorithms  shows 
how  UHRR  radar  signatures  are  processed  and  classified.  It  is  shown  that  final  alignment 
and  classification  occur  at  the  same  time  in  the  AGC.  Thus,  essentially,  the  same  information 
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is  being  used  to  make  both  assessments  about  a  data  signature.  This  analysis  appears  in 
Section  2.5. 

The  four  class  UHRR  radar  problem  is  rigorously  analyzed.  The  results  show  that  align¬ 
ment  has  a  statistically  significant  impact  on  the  AGC’s  ability  to  classify  data  (Section  4.2). 
Because  alignment  is  directly  associated  with  the  statistical  decision  for  class  membership, 
the  alignment  issue  will  play  a  role  in  any  classifier  applied  in  the  same  way.  Independent 
of  alignment,  excellent  classification  rates  above  95%  are  produced  with  Gaussian  and  non- 
parametric  approaches  (Section  4.3).  The  data  itself  is  shown  to  be  reasonably  separable, 
independent  of  the  alignment  issue,  because  the  Bayes  bounding  analysis  gives  classification 
rates  in  the  98*^  percentile  for  all  two  class  combinations  (Section  4.4). 

6.2.2  Feature  Discrimination.  The  feature  selection  technique  given  by  Lee 
and  Landgrebe  is  applied  successfully  to  two  UHRR  radar  situations  and  results  achieve 
classification  rates  on  the  order  of  90%  for  substantially  reduced  feature  combinations  (Section 

4.5.1) .  However,  the  roll-off  in  the  the  four  class  situation  highlights  that  the  new  feature  set 
is  a  linear  combination  of  the  original  features.  This  roll-off  also  demonstrates  that  redundant 
features  can  detract  from  a  classifier’s  performance.  The  linear  dependence  problem  is  solved 
and  a  technique  is  demonstrated  which  truly  reduces  the  number  of  features  required  for 
classification  with  rates  that  meet  or  exceed  classification  with  a  full  feature  set  (Section 

4.5.2) .  It  is  shown  that  as  fewer  than  5%  of  the  features  in  the  original  feature  space  may  be 
used  to  attain  classification  rates  of  over  95%  in  the  two  class  case,  and  nearly  90%  in  the  four 
class  case  .  This  new  technique  interprets  the  orientation  of  the  eigenvectors  to  deduce  which 
feature  elements  are  most  relevant  to  the  classification  problem  (Section  3.7). 

6.2.3  Selection  of  Relevant  Scales  in  an  MRA.  The  new  feature  selection  technique 
is  applied  to  an  MRA  and  successfully  selects  the  discriminantly  relevant  scales  to  produce 
classification  rates  on  the  order  of  those  attained  using  the  raw  UHRR  radar  data.  For  limited 
feature  sets,  it  is  shown  that  its  performance  exceeds  that  of  an  entropy  based  measure  of 
feature  significance. 
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6.3  Conclusions 


The  results  of  this  thesis  imply  the  following: 

1.  Alignment  plays  a  critical  role  in  the  approach  of  the  AGC  towards  the  UHRR  radar 
problem.  The  AGC  approach  inextricably  intertwines  the  alignment  process  and  the 
classification  decision  making  process.  Independent  of  alignment,  UHRR  radar  data  is 
separable  on  the  order  of  95%  classification  rates,  with  or  without  the  assumption  of 
Gaussian  distributions.  The  starting  point  of  a  signal  should  be  found  independent  of 
its  class  to  improve  performance  of  the  AGC. 

2.  For  UHRR  radar  data,  certain  portions  of  an  aircraft  are  more  discriminantly  relevant 
than  others.  Decision  boundary  analysis  reveals  relevant  features  can  vary  from  problem 
to  problem.  The  eigenvector  interpretation  technique  can  be  used  to  highlight  salient 
airplane  characteristics  between  classes.  Further  research  applying  the  techniques  in 
this  thesis  to  various  combinations  of  classes  and  data  sets  is  required. 

3.  Wavelet  analysis  shows  promise  for  discriminating  UHRR  radar  signatures  based  on 
scale  information.  Classification  rates  above  90%  are  comparable  to  classification  in 
which  the  raw  data  is  used.  Also,  scale  analysis  is  directly  related  to  the  properties 
of  the  target.  Extension  of  the  feature  selection  technique  developed  in  this  thesis  to 
full  wavelet  packet  bases  offers  great  potential.  Shift  invariant  wavelet  bases  may  be 
especially  suitable  to  the  specific  UHRR  radar  problem. 

These  conclusions  show  that  this  thesis  meets  the  objectives  laid  out  in  Section  1.5. 
Exciting  implications  for  feature  selection  based  on  the  orientation  of  the  decision  boundary 
are  applied  and  hold  promise  for  a  wide  variety  of  pattern  recognition  problems.  The  problem 
analyzed  here  is  made  more  difficult  by  its  high  dimensionality  and  alignment  issues.  Appli¬ 
cation  to  other,  more  refined  problems  will  give  more  insights  into  the  EDBFM  techniques 
presented  and  may  yield  a  generalizable  theory  for  reducing  the  dimensionality  of  feature 
spaces. 


75 


Appendix  A.  Demonstration  of  Discrimination  with  the  EDBFM 

This  appendix  provides  two  sample  problems  which  show  the  implementation  of  Lee’s 
algorithm  with  respect  to  feature  discrimination.  Each  of  these  cases  is  a  two  class  problem. 
The  vectors  given  as  bases  for  the  transformed,  EDBFM  space,  are  off  by  180  degrees  from 
Lee’s  results,  but  this  is  due  only  to  the  relative  directional  comparisons  between  classes.  The 
Matlab  code  producing  the  randomized  data  and  results  is  given  in  Appendix  C. 

A.l  Sample  Problem  1:  2  Dimensions 

The  data  in  this  example  have  the  following  statistics,  where  the  means  are  represented 
by  Mi  and  covariances  are  represented  by  Ej. 

-1 

Ml  = 

1 

1 

Mj  = 

-1 

1.00  0.50 

El  =  E2  = 

0.50  1.00 

This  data  is  distributed  as  two  ellipses  and  are  shown  in  Figure  18.  The  EDBFM  is  calculated 
by  taking  the  outer  product  of  normal  vectors  at  discrete  points  along  the  decision  boundary. 

Normal  vectors  are  foimd  by  finding  the  intersection  of  the  directed  vector  connecting 
two  data  points  and  the  equation  for  the  decision  boundary.  One  cycles  through  each  data 
point  in  each  class,  using  only  corrected  classified  samples.  The  first  two  iterations  of  this 
process  are  shown  in  Figure  19.  The  number  of  circles  in  the  diagram  are  reduced  by  the 
chi-squared  test.  The  dashed  line  represents  the  equation  for  the  decision  boundary.  Again, 
as  indicated  in  the  text  of  the  thesis,  the  outer  products  of  all  the  normal  vectors  are  averaged 
to  produce  the  EDBFM.  Eigenvalues  and  eigenvectors  of  the  EDBFM  are  then  taken  to  yield 
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All  Samples 


-4  -3  -2  -1  0  1  2  3  4  6 


Dimension  1 

Figure  18.  2  Dimensional  Sample  Problem 


Nearest  Neighboring:  Class  1  vs  Class  2 


-4  -3  -2  -1  0  1  2 

Dimension  1 


Figure  19.  Calculation  of  EDBFM 
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information  for  the  transformation  and  feature  reduction.  In  this  case  the  results  are: 


EDBFM 


.25  -.25 
-.25  .25 


$ 


-.7071  -.7071 
.7071  -.7071 


A  = 


.5000  .0000 

.0000  -.0000 


where  $  holds  the  eigenvectors  in  its  columns  and  A  holds  the  associated  eigenvalues  along 
its  main  diagonal.  In  Figure  20,  one  may  see  the  directions  of  relevance  by  visualizing 
the  eigenvectors  given  in  the  $  matrix  onto  the  graph.  Figure  20  shows  how  the  decision 
boundary  is  ultimately  estimated  by  connecting  data  points  and  finding  the  intersection  with 
the  boundary  equation. 


Figure  20.  Estimated  Decision  Boundary 
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A.2  Sample  Problem  2:  3  Dimensions 


This  section  shows  the  algorithm  work  in  a  three  dimensional  situation.  Plots  show  the 
xy-p\ain&  only,  so  data  has  been  projected  onto  the  paper.  Note  that  this  is  a  convenient  analogy 
with  what  happens  during  the  EDBFM  transformations.  The  third  dimension  is  coming  out  of 
the  page,  but  is  irrelevent  to  classification. 


The  statisitics  for  the  data  are  as  follows: 


Ml 


M2 


Si 


S2 


0 

0 

0 

0 

0 

0 


3.00 

0.00 

0.00 

0.00 

3.00 

0.00 

0.00 

0.00 

1.00 

1.00 

0.00 

0.00 

0.00 

1.00 

0.00 

0.00 

0.00 

1.00 

In  two  dimensions,  the  data  looks  as  shown  in  Figure  21.  The  other  two  figures  show 
information  in  the  same  way  as  the  previous  sample  problem. 


The  resulting  computations  yield: 


EDBFM 


.2272 

-.0323 

.0000 

-.0323 

.2728 

.0000 

.0000 

.0000 

.0000 

79 


-4  -2  0  2  4  6 

Dimension  1 


Figure  21.3  Dimensional  Sample  Problem  (xy  Plane) 


Dimension  1 

Figure  22.  Calculation  of  EDBFM 
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Figure  23.  Estimated  Decision  Boundary  (xy  Plane) 


$ 


A 


-.7978  -.6030  .0000 

.6030  -.7978  .0000 

.0000  .0000  1.0000 


.2854 

.0000 

.0000 

.0000 

.2146 

.0000 

.0000 

.0000 

.0000 

In  this  case  it  is  easy  to  see  that  two  of  the  dimensions  are  required  for  discrimination  in  the 
original  space  or  in  the  transformed  space.  This  sample  problem  shows  that  higher  amplitude 
elements  of  the  significant  eigenvectors  “point  to”  the  key  dimensions  in  the  original  space, 
i.e.  the  first  two  elements  are  “high”  relative  to  the  third  element. 
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Appendix  B.  Alternative  Training  Methods  for  the  AGC 

This  appendix  provides  two  alternative  training  methods  which  show  how  classification 
rates  may  be  improved. 

B.l  Recursive  Coarse  Alignment  of  Template 

B.1.1  Introduction.  In  the  calculation  of  a  class  template  during  training,  exemplars 
are  finely  aligned  to  the  cumulative  template  and  then  added  into  the  template.  This  fine 
alignment  shifts  the  current  exemplar  away  from  its  true  power  centroid,  as  calculated  by  the 
coarse  alignment  routine.  When  this  off-centroided  signature  is  included  in  the  template,  the 
resultant  class  template  no  longer  has  a  power  centroid  located  at  the  center  bin.  At  the  end  of 
the  training  process,  usually  involving  five  hundred  or  more  training  signatures,  the  resultant 
power  centroid  of  the  mean  template  was  found  to  be  8  to  18  bins  away  from  its  center  bin. 

It  was  conjectured  that  this  offset  is  enough  to  prevent  some  in-class  samples  from  being 
properly  aligned  because  the  centroid  of  the  mean  was  no  longer  centered.  This  “centroid 
drift”  problem  was  corrected  by  recursively  circularly  centroiding  the  mean  template  after 
each  current  exemplar  was  added  in.  The  C  code  of  the  AGC  was  modified  to  accomplish  this 
goal. 


B.l. 2  Results.  Results  are  presented  in  the  following  tables  for  the  three  cases 
indicated.  Results  were  generated  by  using  the  same  methodology  given  in  Chapter  3. 
Comparisons  with  the  other  AGC  runs  in  Chapter  4  are  made  in  the  last  section  of  this 
appendix. 

B.2  Full  Correlations 

B.2.1  Introduction.  As  was  described  in  Chapter  2,  correlations  are  restricted  to 
shifts  of  ±18  bins  during  training.  (During  testing,  this  limited  shift  range  helped  prevent 
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Table  14.  AGC:  Pristine  Data,  Recursive  Alignment 


|B 

Pcc 

(%) 

95  %  Confidence 
Interval 

9a 

379.80 

50.03 

1.46 

61.70 

77.03 

±1.91 

aa 

1.55 

0.00 

99.22 

±0.39 

f2 

0.55 

11.35 

277.87 

mm 

94.19 

±1.10 

fa 

■01 

3.72 

91.43 

±1.67 

Table  15.  AGC:  Arbitrary  Data,  Recursive  A 


9a 

Assigne 

aa 

.d  Class 
f2 

fa 

Pcc 

(%) 

95  %  Confidence 
Interval 

9a 

383.47 

55.93 

1.15 

63.43 

76.08 

±2.15 

aa 

2.22 

498.50 

0.00 

3.27 

98.91 

±0.53 

f2 

1.31 

21.16 

471.62 

91.57 

±1.40 

fa 

7.99 

31.21 

2.34 

462.47 

91.76 

±1.39 

igmnent 


Table  16.  AGC:  Train  Pristine,  Test  Arbitrary,  Recursive  Alignment 


9a 

Assigne 

aa 

.d  Class 
f2 

fa 

Pcc 

(%) 

95  %  Confidence 
Interval 

9a 

379.90 

74.01 

1.44 

61.64 

73.48 

±2.05 

aa 

2.60 

500.07 

0.00 

3.33 

98.82 

±0.49 

f2 

9.41 

19.87 

641.49 

66.22 

87.04 

±1.56 

fa 

8.24 

34.92 

4.42 

463.43 

90.68 

±1.54 
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interclass  confusion).  This  experiment  allows  for  full  correlations  during  fine  alignment  in 
the  training  process. 


fi.2.2  Results.  Results  are  presented  in  the  following  tables  for  the  three  cases 
indicated.  Results  were  generated  by  using  the  same  methodology  given  in  Chapter  3. 
Comparisons  with  the  other  AGC  runs  in  Chapter  4  are  made  in  the  last  section  of  this 
appendix. 


Table  17.  AGC:  Pristine  Data,  Fully  Correlated  Alignment 


Actual 

Class 

9a 

Assigne 

aa 

!d  Class 
f2 

fa 

Pcc 

(%) 

95  %  Confidence 
Interval 

9a 

380.52 

50.21 

0.87 

61.40 

77.19 

±1.92 

aa 

1.57 

500.11 

0.00 

2.32 

99.22 

±0.40 

f2 

.818 

11.47 

277.05 

5.65 

93.91 

±1.09 

fa 

8.29 

31.27 

2.62 

456.81 

91.55 

±1.37 

Table  18.  AGC:  Arbitrary  Data,  Fully  Correlated  Alignment 


Actual 

Class 

9a 

Assigne 

aa 

id  Class 
f2 

fa 

Pcc 

(%) 

95  %  Confidence 
Interval 

9a 

386.70 

53.658 

0.55 

63.58 

76.73 

±2.11 

aa 

2.16 

498.63 

0.00 

3.20 

98.93 

±0.51 

f2 

1.49 

21.41 

466.89 

25.20 

90.66 

±1.45 

fa 

8.07 

31.26 

1.61 

463.06 

91.87 

±1.52 

B.3  Leave  One  Out  Results 

Leave  one  out  results  were  generated  for  a  single  ordering  of  the  data.  That  is,  the  order 
of  presentation  to  the  training  algorithm  remained  the  same  throughout.  These  results  are 
consistent  with  the  results  attained  in  the  overall  runs.  The  original  AGC  technique  (without 
the  recursive  alignment  or  full  correlation  options)  was  used  in  the  pristine  and  arbitrary  cases. 
The  pristine  case  resulted  in  a  92.29%  classification  rate  and  the  arbitrary  case  resulted  in  a 
89.45%  classification  rate.  As  predicted  by  Devijver,  the  hold  out  runs  done  earlier  in  the 
thesis  provide  pessimistic  estimates  of  the  error  rates.  They  predict  higher  rates  than  would  be 
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expected.  Part  of  this  is  due  to  the  problem  of  using  test  sets  with  some  common  test  vectors 
over  the  trials. 


B.4  Summary 

This  section  compares  results  presented  in  this 
“mixed”  case  in  Table  19  refers  to  training  on  pristine 
Chapter  3  for  full  definitions  and  methodology. 

Some  observations  are  worth  pointing  out.  First,  there  is  a  general  improvement  in 
performance  with  respect  to  the  arbitrary  data.  In  the  mixed  case,  there  is  a  definite  jump  in 
the  classification  rate,  showing  that  the  previous  inability  to  learn  the  alignment  problems  is 
corrected.  Third,  relative  class  performance  generally  remains  the  same  except  that  the  full 
correlation  runs  associated  with  class  f2  show  marked  improvement. 


appendix  with  the  original  runs.  The 
data  and  testing  on  arbitrary  data.  See 


Table  19.  Comparison  of  Results 


Case 

Pcc 

(%) 

97.5  %  Confidence 
Interval  (%) 

Pristine  (AGC) 

90.16 

±1.31 

Arbitrary  (AGC) 

87.13 

±1.59 

Pristine  (Recursive) 

90.12 

±1.35 

Arbitrary  (Recursive) 

89.59 

±1.54 

Pristine  (Full  Correlation) 

90.14 

±1.37 

Arbitrary  (Full  Correlation) 

89.56 

±1.53 

Mixed  (AGC) 

82.60 

±3.32 

Mixed  (Recursive) 

87.40 

±1.54 
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Appendix  C.  Computer  Code 


This  appendix  provides  Matlab  code  which  may  be  used  to  implement  Lee’s  EDBFM 
technique  in  a  2  and  4  class  problem. 

C.l  Sample  Problem  Code 

This  section  contains  code  to  produce  the  sample  problems  found  in  Appendix  A.  It 
should  run  without  modification.  All  plot  commands  have  been  removed. 
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%  This  program  demonstrates  the  Lee  and  Landgrebe  algorithm, 

\  as  presented  in  their  article. 

clear 

I  Set  Environment  Parameters 

%  Choose  number  of  samples  per  class 
numslgs  -  100; 

1  Choose  which  sample  problem  to  run 
option  -  3 

%  Set  Statistics  for  each  option 

if  option--!; 

dimen-2; 

rand( 'seed' , 0) ; 

rcl  -randn(numsigs, dimen} ; 

rand( 'seed' ,2); 

rc2  -randn(numsigs, dimen) ; 

ml  -  (-1  1}'; 
m2  -  [1  -1]'; 
vl  -  [1  .5;. 5  11; 
v2  -  (1  .5; .5  1]; 
end 

if  option— 2; 

dimen-2; 

rand( 'seed' , 0) ; 

rcl  -randn(nufflfiigs, dimen); 

rand( 'seed' ,2) ; 

rc2  -randn(numsigs, dimen) ; 

ml  -[.05  0]'; 
m2  -  ('.05  0]'; 
vl  -  (3  0;0  3]; 
v2  -  (3  0;0  1); 
end 

if  option— 3/ 

dimen-3 ; 

rand( ' seed' , 0) ; 

rcl  -randn(nuflisig8, dimen) ; 

rand( 'seed' ,2} ; 

ra2  •randn(numsig8, dimen) ; 

ml  -(0  0  01 ' ; 
m2  •  (000]'; 

vl  -  [3  0  0;0  3  0;0  0  1]; 
v2  -  (1  0  0;0  1  0;0  0  1]; 
end 

if  option— 4; 

dimen-4; 

rand( ' seed' , 0) ; 

rcl  -randn(num8ig8, dimen) ; 

rand( ' seed' , 2} ; 

rc2  -randn(num8igs, dimen) ; 


ml  -(-1  1  -1  1]'; 
m2  -  (-1  1  -1  1]'; 

vl  -  [6  0  0  0;0  6  0  0;0  0  1  0;0  0  0  1]; 
v2  -  [1  0  0  0;0  1  0  0;0  0  1  0;0  0  0  1]; 
end 

%  Calculate  information  for  colorizing  the  noise 
[Vl,dl]  -  eig(vl); 

[V2,d2]  -  eig(v2); 

lambdal  -  8qrt(dl); 
lambda2  -  8grt(d2); 

pi  -  .5; 
p2  -  .5; 
cov{rcl) ; 
cov(rc2) ; 

%  Colorize  the  white  noise 
if  dlmen— 2 

rclcolor  -  Vl*lambdal*rcl'  + 

[ml(l)*onea(l, numslgs) ;ml  (2) *ones(l, numslgs) ] ; 
rc2color  -  V2*leunbda2*rc2'  +  ., 


(m2 ( 1 ) *ones ( 1 . nums igs ) ; m2 ( 2 ) *onea ( 1 , numsigs ) ] ; 
elseif  dlmen"“3 

rclcolor  ■  Vl*lanibdal*rcl'  +  .. 

Iml(l)*ones(l, numsigs) ;ml(2)*one8(l, numsigs) ;nil(3)*onea(l,num8igs)] ; 
rc2color  -  V2*lanibda2*rc2'  +  .. 

{m2 (l)*ones(l, numsigs) , •m2 (2) *ones(l, numsigs) ;ra2 (3 )*ones(l, numsigs)] ; 
elseif  dimen"4 

rclcolor  ■  Vl*lanibdal*rcl'  +  . . 

(ml(l)*ones(l,  numsigs)  ;ml(2)  *onea(l,  numsigs)  ,‘ml(3)*ones{l,  numsigs) , -1111(4  )*ones(l,  nums  igs)  ]  ; 
rc2color  -  V2*lambda2*rc2'  *  .. 

(m2(l) *ones(l, numsigs) ;m2  (2) *one8(l, numsigs) ;m2(3)«ones(l,num8igs);m2(4)*one8{l,num8lg8)]; 
end 

I  Compute  means  and  covariances 

sigl  -  cov( rclcolor' ) » 

sig2  -  cov(rc2color' ) ; 

sigl  -  vl; 

sig2  -  v2; 

detsigl  -  det{8lgl); 
detsig2  -  det(8ig2); 

invsigl  -  inv(slgl); 
invslg2  -  inv(8ig2)  ; 

Ml  -  mean(rclcolor' )  ; 

H2  -  mean(rc2color'); 

Ml  -  ml ' ; 

M2  -  m2 ' ; 

cl  -  rclcolor'  ; 
c2  -  rc2color' ; 

%  Initialize  mahalanobis  distances 
%  and  classification  matrices 

mdll  -  zero8(numaigs,  1)  ; 
mdl2  •  zeros(nuffl8ige,l}; 
md21  •  zerc8(nuiiisigs,l); 
md22  -  2ero8(nufflsig8,l); 

dll  <■  2ero8(nu]nsig8, 1) ; 
dl2  •  zero8(nuaisig8<  1) ; 
d21  ■  zeros(nuasigs, 1) ; 
d22  ■  2eros(nuffi6ig8/ 1) ; 

I  Classify  the  data  using  a  single  gaussian  classifier 
for  1  ■■  l:numsigs 

lodll(i)  -  (cl(i,  !)-Ml)*lnv8lgl*(cl(i, i)-Ml)'; 

mdl2(i)  -  {c2(i,!)-Hl)*invsigl*{c2(l,0-Ml)'; 
md21(i)  -  (cl(l,  i)-M2)*lnvsig2*(cl(i, s)-M2)'; 
md22(i)  -  {c2(l, : ) -H2)*lnv8ig2*{c2(l, i ) -M2} ' ; 

dU(i)  -  .5*iadll(l)  .5*log(det8igl)  ? 
dl2{i)  -  .5*mdl2(i)  +  .5*log(det8igl) ? 
d21(i)  -  .5*iad21(i)  .5*log(det8tg2)  ? 
d22(i)  -  .5*md22(i)  +  .5*log(det8ig2) ; 
end 

bl  -  ,5*log(detsigl); 
b2  -  .5*log(det8ig2) ; 

correctl  -0; 
correct2  ■  0; 

t  Classify  the  data  and  sort  the  data  into 
%  good  and  bad  groups 

%  The  good  group  is  retained  for  later  analysis 
wchl  -  (] ;wchbadl"(] ; 
wch2  -  [] ;wchbad2-I] ; 

for  j-l:numsig8 

if  dll{j)  <  d21(j) 
correctl-correctl+1 ; 
wchl  -  (wchl;  j] ; 
else 

wchbadl  -  (wchbadl;  j]; 
end 

if  d22(j}  <  dl2{j) 
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correct2-correct2+l ; 
wch2  -  (wch2;  j]; 
else 

wchbad2  -  (wchbad2;  j]; 
end 

t  Retain  distance  and  classification  information 
%  for  correctly  classified  samples 

mdllt  -  mdll(wchl, : ) ; 
mdl2t  -  mdl2(wch2, : ) ; 
md21t  -  md21(wchl, :  ); 
md22t  -  md22 (wch2, : ] ; 

dllt  -  dll(wchl,  : 
dl2t  -  dl2(wch2, :  ); 
d21t  -  d21(wchl,  :); 
d22t  -  d22(wch2, :); 

truel-cl(wchl, : ); 
true2“c2(wch2, : ); 
badl-cl(vchbadl, :); 
bad2-c2(vchbad2< : } ; 
end 

I  Perform  chi  test  to  eliminate  outliers 

t  The  following  is  taken  from 
I  Arnold  0.  Allen  PP137-138, 625 
stat  -  'performing  xiaquare  test' 
n  -  2; 

zalpha  -  1.6449; 
chialpha  -  4.9915; 

if  dimen— 3 
chialpha  -  7.815; 
end 

if  dimen— 4 
chialpha  -  9.4877; 
end 

%  Apply  in’Claas  chi’teat,  retaining  relevant  data  points 
I  Kote  here  one  could  have  just  as  easily  used  mdxxx  and 
I  compared  it  directly  to  chialpha 

8tat-'in*class  chl*square' 

rell  •  truel(chltesttoy(dllt,chlalpha+bl),  :); 
rel2  -  true2(ehite8ttoy(d22t,chlalpha4-b2)  > : } ; 

%  Apply  chi  test  to  other  classes 
stat-'interclasa  chi-square' 

Lmin  -5; 

chialpha2  "  chialpha; 

rell2  -  true2(Ltesttoy(dl2t, chlalpha2tbl,Lmin) , : ) ; 
rel21  -  truel(Lte8ttoy(d21t/Chlalpha2+b2>LiiLin)  r : )  > 

t  Calculate  EDBFM 
I  Begin  by  reorganizing  data 
stat  -  'calculate  N  and  EDBFM' 
ml*ml' ; 
m2-m2' ; 

%  NOTE  that  here  we  use  ACTUAL  (non-estimated)  values 
%  for  the  pdf  parameters 
P  -  ( ]  ; 

PI  -  ( ]  ; 

P2  -  ( ]  ; 

k  Loop  on  each  class 
for  i  -  1:2 

eval( ( 'cofi  -  rel'  lnt28tr{l)  ';']) 
eval( I 'cmref  -  m'  lnt28tr(i)  ';']) 
eval( ( 'cvref  -  v'  int28tr(l)  ';']) 

numcofi  -  8ize(cofl,  1) ; 

tktkktttikitttkktitkitititktttttktiittitkttktttkkttkttttkkitkt 

\  Compare  to  other  class 
for  j-l:2 


if  j--l 

EDBFM  -  2eros(diinen, dimen); 


N  ■  zeros(l, numcofi) ; 

eval( [ 'othc  “  rel'  lnt28tr{l)  lnt2str(j) 
eval( ['cmother  -  m'  lnt2str(j)  ' ; '  ] ) 
eval( ['cvother  “  v'  lnt28tr(j) 

for  k  -  1: numcofi 

closest  -  findnearesttest(cofi(k, ; ) ,othe,cvref ); 

[N, p]  -  computeNPtest(cofi(k, : ) ,cmref , cvref, closest, cmother, cvother, .5, .5, i) ; 
P  -  [P  p]; 

if  i”l 
PI  -  (PI  p] ; 
elseif  i  —2 
P2  -  (P2  p]; 
end 

EDBFM  -  EDBFM  +  (1/numcofi) * (N*N' ) ; 
end 

eval(('E'  nmn2atr(i)  num2atr(j)  '  -  EDBFM;']) 

end 

end 

end 

t  Nov,  combine  each  into  the  final,  averaged  EDBFM 

EDBFM  «  zero8(dimen, dimen) ; 

pi  •  .5; 
p2  -  .5; 

for  1-1:2 
for  j-l!2 
if  j*-l 

eval{(' EDBFM  ■  EDBFM  +  pl*p2*E'  nuin28tr(i)  nura28tr(j) 

end 

end 

end 

\  Calculate  eigenvalues  and  eigenvectors  for  future  use 
(V,d]  -  elg(EDBFM) 

rankdbfffl  -  rank( EDBFM] 

\  Chltesttoy  subroutine 
function(goodonlyJ  -  chitest(x,chialpha) 

goodonly  ■  ( 1 ; 

for  i  -  l:8ize(x,l) 

if  x(i)  <  chialpha 
goodonly  -  [goodonly;  i]; 
end 
end 

ILtesttoy  subroutine 

function[goodonly]  -  Lte8t(x, chialpha, Lml) 

goodonly  -  ( ] ; 

for  i  -  l:8ize(x,l) 

size(x, 1} ; 

if  x{i}  <  chialpha 

goodonly  -  {goodonly;  i); 

end 

end 

chk  -  8ize(goodonly, 1) ; 

if  chk  <  Lml 
[dum  wch)  -  8ort(x); 
goodonly  -  wch{l:Iml); 
end%if 

Ifindnearesttest  subroutine 

function{winner]  -  &ndnearestte8t(reference,test8lg8,var) 

(numsigs, feats]  -  slzeCtestsigs) ; 
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bestmatch  -  0; 

dlsts  -  zeros(nuiiisigs,l); 

var-var' ; 


for  i  -  linumsigs 
i; 

xtrial  -  test3igs{ i, : )  ; 
tlxrwx)  -  size(xtrial); 

[lref,wref]  -  8ize(reference) ; 

dists(i)  -  (xtrial*reference)*(inv{var) )*(xtrial-reference) ' ; 
end 

[Y,maxind]  -  mln(dlat8)  ; 
winner  -  test8ig8(maxind, :  ); 

%  computeNPtest  subroutine 

t  Implement  equations  found  in  Lee  6  Landgrebe  article 
I  Note:  orient  is  used  to  insure  that  you  are  the  vector 
t  connecting  the  two  sample  points  are  pointed  the  same 
%  way  when  you  switch  from  comparing  class  1  to  class  2 
%  and  vice-versa 

f unction (N,P]  -  computeNPteat(xl,ml,vl,x2,ni2,v2,probl,prob2, orient) 

xl ;  x2  ;ml ;  vl ;  di2  ;  v2 ; 
xl  -  xl' ;x2  -  x2' ; 
ml  ■  ml'  ;m2  ■  m2' ; 

vO  -  xl; 
v  -  x2*xl; 

detl  ■  det(vl) ; 
det2  •  det(v2) ; 

invl  -  Inv(vl) ; 
lnv2  -  lnv{v2) ; 

c  ■  .5  *  (ml'*invl*ml  •  m2'*lnv2*m2)  +  ,5*log(  detl/det2  ); 
cprime  •  .5»v0'  *  (lnvl*lnv2)  *  vO  •  {ml' •lnvl-m2 '  •inv2)*v0  +  c; 

b  -  v0'*(lnvl-inv2)*v  -  {ml' •lnvl‘m2' *lnv2)*v; 
a  -  .5  •  V'  *  {invl*inv2)  *  v; 

t  log{probl/prob2) ; 

if  a  —  0 

u  -  (t-cpriffle)/b; 

else 

ul  -  {-b  +  8qrt{b''2*4*a*(cprlme-t)))/{2*a); 
u2  •  {-b  -  8qrt{b'"2-4*a»(cprime-t) )  )/{2*a) ; 
if  orient— 1 
u  -  min( [ul  u2] ) ; 

else 

u  -  max{ [ul  u2] ) ; 

end 

end 

p  -  u*v+v0; 

N  -  {invl-inv2)*p  +  (inv2*ml  •  invl*iii2}; 

P-p; 

N  -  N/aqrt(N'*N); 
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C.2  4-Class  Code 


This  section  contains  code  used  in  the  four  class  problems  of  Chapter  4.  The  first 
section  here  finds  the  EDBFM  and  its  eigenvectors  and  eigenvalues.  The  second  section 
gives  code  for  reclassification  in  the  transformed  space.  The  third  section  gives  code  for 
reclassification  in  the  original  space.  Associated  subroutine  may  be  found  in  the  final  section 
of  this  appendix.  Note  that  the  programs  are  especially  designed  for  the  UHRR  radar  problem 
(due  to  alignment).  It  should  not  be  difficult  to  generalize  the  concepts  to  other  problems  or 
simplify  it  to  a  two  class  problem. 
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C.2.1  EDBFM  Analysis. 


%  SET  NUMBER  OF  CLASSES  AND  NUMBER  OF  FEATURES  USED 

%  Class  id's  correspond  to  1  —  9a 
%  2  ”  aa 
%  3  —  f2 

I  4  —  fa 

feats  ••  256 

cla8ses-4 
n  -  classes; 

troubleshoot  ■  0; 

load  trlalp256 

load  paramspr8256 

t  This  section  uses  a  subroutine  to  find  the  correctly 
t  classified  sample  from  the  UHRR  training  and  teat  sets 
t  The  files  v/  tst  prefixes  contain  test  data 
if  1—0 

load  tst9a256 
load  t8taa256 
load  t8tf2256 
load  tatfa256 

sigs  -  (t8t9a256;  tataa2S6;  t8tf22S6;  t8tfa256}; 

\  Single  out  correctly  classified  samples  only  and  store 
I  The  following  variables  hold  correctly  classified  samples  (cxxg) 
t  and  the  respective  test  statistics  (dxxg)  as  computed  by  the  AGC 

stat  -  'finding  correctly  classified  data' 

Ic9ag  eaag  cf2g  cfag  d9ag  daag  df2g  dfag  maxindices]  -  . . 
findcorreet(8ig8,euia.name,cuin_di8, 4) ; 

save  paramsprs256  c9ag  caag  cf2g  cfag  d9ag  daag  df2g  dfag  maxindices 
end 


I  FOR  TROUBLESHOOTING,  LIMIT  THE  NUMBER  OF  SIGNALS  CONSIDERED 

tttt4t44t4t«tltt%l%t%«444«4t4444%44«t%444tt%l«44«tt%4tt%tlt 

if  troubleshoot  —  1 
howmany  ■  100; 
c9a  •  c9ag(l:howmany, !); 
caa  -  caag(l:howmany,  :)/' 
cf2  -  cf2g(l!howniany,  ! )  ; 
cfa  -  cfag{l :hovmany,  : ) ; 

d9a  •<  d9ag(l  :howmany,  : ) ; 
daa  -  daag( li howmany/ i ); 
df 2  -  df2g(l:howmany, : ) ; 
dfa  -  dfag(l:hoviiiany,  i ); 
else 

c9a  -  c9ag; 
caa  -  caag; 
cf2  -  cf2g; 
cfa  -  cfag; 

d9a  -  d9ag; 
daa  -  daag; 
df2  -  df2g; 
dfa  -  dfag; 

end 

I  Eliminate  Outliers  of  each  class  with  Chi’Square  Test 

\  Find  biases  and  subtract  out  to  get  true  Mahalanobis  distance 
stat  -  'finding  biases' 

I  IMPORTANT:  PRELIMINARY  RESULTS  INDICATE  YOU  MUST 
t  RETAIN  THE  BIAS  IN  THE  DISTANCE  CALCULATION  BECAUSE 
I  YOU  MAY  OTHERWISE  ENCOUNTER  TWO  POINTS  NOT  SEPARATED 
\  BY  THE  h(X) .  RECALL  THAT  POINTS  MOST  STRADDLE  THE 
%  THE  DECISION  BOUNDARY  OR  NO  INTERSECTION  CAN  BE  FOUNDI 
%  (BIAS  IS  THE  DETERMINANT  OF  THE  CLASS  COVARIANCE) 
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%  prsxxtemp  Is  a  variable  representing  the  respective 
%  class  of  pristine  data 

(b9a  m9a  v9a]  -  findbla8( 'prs9atemp' ) ; 

[baa  maa  vaa)  -  findbias( 'prsaatemp' ); 

(bf2  inf2  vf2]  ■  findbias( 'pr8f2temp' )  ; 

[bfa  mfa  vfa]  -  findbias( 'prsfatemp' ) ; 

d9at  -zeros(size(d9ag} } ; 

daat  -zeros(size(daag} } ; 

df2t  -  zero8(8ize(df2g) ) ; 

dfat  -  zeros{size(dfag) ) ; 

I  HERE,  RETAIN  ORIGINAL  CLASSIFICATION  DISTANCE: 

d9at  -  d9a; 
daat  ^  daa; 
df2t  -  df2; 
dfat  -  dfa; 

stat  -  'performing  xlaquare  test' 


I  The  following  info  is  taken  from 
\  Arnold  0.  Allen,  Prob,  Stats,  and  Queuing 
\  theory  pp  137-138;and  p.  625,  table  4 

n  ■  feats:  %,  the  t  of  degrees  of  freedom 
zalpha  -  1.6449;  4  951  inclusion 

tzalpha  -  1.96;  %  97.5  inclusion 
Izalpha  -  2.3263  t  99.9  inclusion 
Izalpha  -  -1.6449;  %  51  Inclusion 
tzalpha  -  -1.96;  t  2.5  inclusion 

\  The  test  statistic  with  n  >  100: 

chialpha  -  n*(l-2*(9*n)'“(-l)+zalpha*sqrt(2*(9*n)‘“(-l)))'“(3) 

Itttttttitttitttttttttttlitttitittttlllittttittttttttttt 

\  Apply  in-class  chi-test,  retaining  relevant  data  points 
stat*' in-class  chi-square' 

rell  ■  c9a(chltest(d9at,chialpha<i-b9a,  1) , 
rel2  *  caa(chltest(daat,chialpha'i‘baa,2), :); 
rel3  -  cf2{chlte8t(df2t,chlalpha+bf3,3), !)/ 
rel4  ■  cfa(chite6t(dfat,chialpha-^bfa,4),  :); 

4  Apply  "Chi-Square”  test  to  other  classes 
stat-' interclass  chi-square' 

Lmin  -  50; 

chialpha2  •  chialpha; 

rell2  •  caa(Lte8t(daat,chlalpha2't'b9a,l,Lmin), :}; 
rell3  •  cf2(Ltest(df2t,chialpha2-»'b9a,l,Lmin}, :}; 
rell4  -  cfa(Ltest(dfat,  chialpha2'»'b9a,  l.Lmin] , : 

rel21  -  c9a(Ltest{d9at,  chlalpha2-^baa,  2,Lfflin) , : 
rel23  -  cf2(Ltest(df2t, chialpha2+baa, 2,Lmin) , : ); 
rel24  ■>  cfa(Lte8t(dfat,chialpha2-fbaa,  2,Lmin) , : 

rel31  -  c9a(Ltest(d9at,chialpha2't-bf2, 3,Lmin) , : 
rel32  -  caa(Ltest(daat,chialpha2tbf2,3,Lfflin), :); 
rel34  -  cfa(Lte6t(dfat,chlalpha2't-bf2,3,Lfflin), :); 

rel41  -  c9a(Ltest(d9at,chi8lpha2+bfa, 4,Lniin) , : ); 
rel42  -  caa(Lte8t(daat,chialpha2'fbfa,4,Lfflin), :); 
rel43  -  cf2(Lte8t(df2t,chialpha2+bfa,4,Liiiin), :}; 

IttttttttttItttlItttIttttttttttttItttItttttlIttlIttttttI 

t  The  xxxxxtemp  files  hold  template  information  (mean 

t  and  variances]  for  each  class 

stat  -  'loading  mean  and  variance  info' 

tload  prs9atemp 

ml  -  m9a; 

vl  -  v9a; 

tload  prsaatemp 

m2  maa; 

v2  -  vaa; 

tload  prsf2temp 

m3  -  fflf2; 

v3-vf2; 

tload  prsfatemp 

m4  -  mfa; 

v4  -  vfa; 

tttttttttttttttttttttttttttttttttttttttttttttttttttttttttti 

ttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt 


I  OK,  keep  your  fingers  crossed 
stat  -  'calculate  N  and  EDBFM  ' 
for  1*1:4 

eval( ['cofi  -  rel'  int2atr(i) 
eval( ['cmref  *  m'  int28tr(i) 
eval( ['cvref  -  v'  int2str(i) 

numcofi  ■  ai2e(cofi,l) 

for  j*l:4 
[i  j] 
if  j“-i 

EDBFM  -  zeros(n,n); 

eval( ('othc  -  rel'  int2str{i)  lnt2str{j) 
eval(  ( 'cmother  -  m'  int2atr(j)  '.•']) 
eval( [ 'cvother  -  v'  lnt2atr( j }  ' ; '  ] ) 

%  The  following  scaling  trick  is  required  to  prevent  machine 
t  precision  problems.  The  factor  of  100  divides  out  in  the 
I  computeNP  subroutine 
detref  -  det(dlag(100*cvref ) )  ; 
detother  -  det(diag{100*cvother) )  / 

invref  -  lnv{dlag(cvref ) ) ; 
invother  •  inv(diag(cvother) ) ; 

for  k  -  1: numcofi 
[k  numcofi] 

closest  -  findneare8t(cofi(k, I ) ,othc,cvref ); 

if  i  <  3 
orient*!; 
elseif  j  >  1 
orient-2; 
end%i& 


[N,p]  -  computeMP(cofi(k, : } ,emref ,evref ,clo8eat,CDOther,  .. 
evother,detref .detother, invref , Invother, .5, .5, orient); 

testN  •  auin(H); 

I  One  should  only  get  imaginary  numbers  if  an  incorrectly  classified 

t  sample  slipped  past  the  guards: 

if  lmag{N)'-0 

'Uh-oh,  N  is  complex' 

numberofimag  -  numberofimag+l; 

else 

EDBFM  •  EDBFM  +  ( 1/numcofi) • (N*N' ) ; 
end 

endifork 

evald'E'  nuia28tr(i)  num28tr(j)  '  -  EDBFM;']) 

end%if j"-! 

end%forj 

endtforl 

stat  -  'calculate  EDBFM  ' 

%  Calculate  EDBFM 

EDBFM  -  zero8(n,n) ; 

%  Assume  aprioris  are  equal 
pi  -l/classes; 
p2  -l/classes; 

for  1  -  1:4 
for  j-l:4 
if  j^-i 

eval(  (' EDBFM  -  EDBFM  +  pl*p2*E'  num28tr(l)  num28tr(j)  ';']) 

endlifj"-i 

endiforj 

endifori 

(V,d]  -  elg(EDBFM); 


rankdbfm  -  rank(EDBFM) 

save  result8256full  EDBFM  rankdbfm  d  V  numberofimag 
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C.2.2  Reclassification  in  Transformed  Space. 


%  This  program  performs  extraction  and  testing 

\  for  a  four  class  problem  in  the  transformed  feature  space 

clear 

I  Load  in  training  and  testing  data  sets 

train  •  1; 

transform  •  1; 

load  trnfa256 

load  trnf2256 

load  trn9a256 

load  trnaa256 

load  tst9a256 
load  t8taa256 
load  tstf2256 
load  tstfa256 

I  Load  in  EDBFM  information 

load  result8256full 

results  -  zeros(32,4)< 
aa  «  1; 

I  kk  is  the  looping  parameter.  For  the  results  in  the  thesis, 
I  I  generarated  them  using  the  following  numbers  of  features: 
I  1,  3  ,  5,  7,  9,..  19,  21,  50,  100,  200,  250,  256 

for  kk  -  (1:2:21  50:50:250  256]} 

correctl  -  0; 

correct2  -  0; 

corrects  •  0; 

correct4  -  0; 

aa 

If  train  ■■  1 

trncl  -  trn9a256{ : , : ) ; 
trnc2  •  trnaa256( : , : ) ; 
trnc3  •  trnf22S6( : , : ) ; 
trnc4  •  trnfa256{ : , : ) ; 

t  TRN  set  has  been  previously  aligned 

I  Select  how  many  relevant  eigenvectors 
I  you  wish  to  use 

relnum  ••  kk; 

\  Recall  that  V  holds  the  eigenvectors  of  the  EDBFM,  ordered 
I  by  magnitude  of  its  eigenvalue.  This  step  selects  the 
t  the  number  to  use. 

Vt  “  V( : ,  256-relnum+l:256) ; 

t  During  troubleshooting,  one  may  elect  not  to  transform 

\  at  all 

if  transform  --  0 

Vt  -  diag(one8(l, 256) ) ; 

end 

t  Perform  transformation,  based  on  the  EDBFM 

'transforming  training  data' 

trncl  -  trncl  *  Vt; 

trnc2  -  trnc2  *  Vt; 

trnc3  -  trnc3  *  Vt; 

trnc4  -  trnc4  *  vt; 

I  Implement  the  training  portion  of  a  Gaussian  classifier, 

I  recursively  computing  computing  mean  and  variances . 

%  Rote  that  the  data  sets  were  previously  aligned  before 
%  extraction  from  the  AGC  algorithm 

'training  9a' 
mx9a  ■  trncl(l, : ) ; 
vx9a  zeros(l,  size(trncl,  2) )  ; 
for  1  -  2:8ize(trncl,l) 
currx  -  trncl(l, : )  ; 

mx9a  -  ( U-l)*mx9a+currx)/l; 

vx9a  -  (((l-l)*vx9a)/i)+({mx9a-currx)  .''2)/(i-l); 
end 
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'training  aa' 

mxaa  ■  trnc2{l, :  }; 

vxaa  “  zeros(l, 8ize(trnc2, 2) ) ; 

for  i  -  2:size{trnc2,l) 

currx  -  trnc2(i,  :  ); 

mxaa  -  ( (i-1)  *mxaa+currx)/i; 

vxaa  -  ( ( (l-l)*vxaa)/i)  +  (  (mxaa-currx)  .''2)/(l-l>; 
end 


'training  f2' 

nixf2  -  trnc3  (1,  : }  ; 

vxf2  -  zeros(l, 8ize(trnc3 , 2) ) ; 

for  i  -  2 :fllze(trnc3, 1) 

currx  -  trnc3(i, :  }; 

inxf2  -  ( (i-1)  *mxf2+currx)/i; 

vxf2  -  ( ( (i'l)  *vxf2)/i)  +  (  (mxf2-currx)  .'‘2)/(i-l); 
end 

'training  fa' 

mxfa  -  trnc4(l, : ) ; 

vxfa  -■  zerofl(l,  8ize(trnc4, 2) } ; 

for  i  -  2 i8ize(trnc4, 1) 

currx  -  trnc4 (i, ; ) ; 

mxfa  -  ((l*l)*iBxfa+currx)/l; 

vxfa  -  (((i-l)*vxfa)/l)  +  (  (mxfa-currx)  .‘'2)/(l-l); 
end 

%  Calculate  bias  terms: 
b9a  -  sum(log(vx9a) ) ; 
baa  -  sun(log(vxaa) ) ; 
bf2  •  aum(log(vxf2) ) ; 
bfa  •  suiQ(log(vxfa) } ; 

endliftrain 


clteat  -  t8t9a256; 
c2test  ■  t8taa2S6; 
cBtest  -  t8tf2256; 
c4test  ■  tstfa256; 

%  Transform  test  data,  as  above 

\  Perform  transformation,  based  on  the  EDBFM 
t  'transforming  test  data' 

cltest  -  cltest  *  Vt; 
c2te8t  -  c2teat  *  Vt; 
cStest  ■  cStest  *  Vt; 
c4te8t  ■  c4te8t  *  Vt; 

I  In  all  of  the  following  '  'dxy' '  implies  one  is  comparing 
I  the  xth  class  mean  with  an  exemplar  from  class  y.  Note  that 
I  1-4  corresponds  to  9a-fa  as  shown  at  the  beginning  of  the  BDBFM 
\  analysis  program 

\  'testing  9a' 

for  i  -  l:size(clte8t,l) 

{i  8ize(clte8t,l)]; 

xt  -  clte8t{i, : ) ; 

t  Give  this  subroutine  the  current  exemplar 
4  the  mean  for  the  respective  class,  the 
%  variance  for  the  respective  class,  the 
4  bias  for  the  respective  class,  and  the 
4  number  of  features  used. 

dll  -  decisionwshiftkk(xt,mx9a,vx9a,b9a,kk); 
d21  -  decisionwshiftkk(xt, mxaa, vxaa, baa, kk); 
d31  -  deci8ionw8hiftkk(xt,mxf2,vxf2,bf2,kk); 
d41  -  decisionw8hiftkk(xt, mxfa, vxfa, bfa, kk); 

4  Now  classify  the  data,  saving  relevant  information  for  later 

[dum,idx]  -  min((dll  d21  d31  d41]); 
if  idx  «  1 
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correctl  -  correctl  +  1; 
end 

correctl; 

end 

%  'testing  aa' 

for  i  -  l:size(c2test,l) 

[1  slze(c2test, 1)  ] ; 

xt  -  c2te8t( 1,  : )  ; 

dll  "  deci8ionwshlftkk(xt,inx9a,vx9a,b9a,)ck); 
d21  -  decislonw8hiftkk(xt,inxaa,vxaa,baa,kk); 
d31  -  deciaionw8hiftkk(xt,mxf2,vxf2,bf2,kk); 
d41  -  decislonw6hiftkk{xt,mxfa,vxfa,bfa,kk}; 

%  Now  classify  the  data,  saving  relevant  information  for  later 

(dum, idx]  -  min( [dll  d21  d31  d41] ) ; 
if  idx  —  2 

correct2  -  correct2  +  1; 
end 

correct2; 

end 

i  'testing  f2' 

for  i  -  I:8ize(c3te8t,l) 

(1  size(c3te8t, 1) ] ; 

xt  -  e3teat(i, : ) ; 

dll  •  decisionv8hiftkk(xt,mx9a,vx9a,b9a,kk); 
d21  •  deci8ionwshiftkk(xt,iDxaa,vxaa,baa,kk); 
d31  -  deci8ionwshiftkk(xt,mxf2,vxf2,bf2,kk); 
d41  -  decisionwshiftkk(xt,raxfa,vxfa,bfa,kk); 

I  Now  classify  the  data,  saving  relevant  information  for  later 

[dum,  idx]  -  mln( [dll  d21  d31  d41] )  ; 
if  idx  —  3 

correctl  •  correct3  +  1; 
end 

corrects ; 
end 


%  'testing  fa' 

for  1  -  I:8ize{c4te8t,l) 

(i  6ize(c4te6t/ 1) ] ; 

Xt  -  c4test(i, 5); 

dll  -  deci8ionwshiftkk(xt,mx9a,vx9a,b9a,kk); 
d21  -  decisionwshiftkk(xt,mxaa,vxaa,baa,kk); 
d31  -  decisionw8hiftkk(xt,inxf2,vxf2,bf2,kk); 
d41  -  decisionw8hiftkk(xt,mxfa,vxfa,bfa,kk); 

t  Now  classify  the  data,  saving  relevant  information  for  later 

(dum, idx]  -  min( [dll  d21  d31  d41] } ; 
if  idx  «  4 

correot4  -  correct4  +  1; 
end 

correct4 ; 
end 


results (aa,l)  -  correctl; 
re8ulta{aa,2)  correct2; 
re8ult8(aa,3)  -  correct3; 
results (aa, 4)  -  correct4; 


aa  -  aa+1; 

if  aa  -■  1 
result8(l, : ) 
end 

endiforkk 

save  tran4rateswshifts  results 
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C.2.3  Reclassification  in  Original  Space. 


%  This  program  performs  extraction  and  testing 

\  for  a  four  class  problem  in  the  original  feature  space 

clear 

%  Load  in  training  and  testing  data  sets 

train  -  1; 

transform  -  1; 

load  trnfa256 

load  trnf2256 

load  trn9a2S6 

load  trnaa256 

load  tst9a256 
load  t8taa256 
load  tstf2256 
load  t8tfa256 

%  Load  in  EDBFM  information 

load  results256full 

results  -  zeroa{32,4); 
aa  -  1; 

%  kk  is  the  looping  parameter.  For  the  results  in  the  thesis, 
%  I  generarated  using  the  following  numbers  of  features: 

%  1,  3  ,  5,  7,  9,..  19,  21,  so,  100,  200,  250,  256 

for  kV  -  [1:2:21  50:50:250  256]) 

correctl  -  0; 

correct2  -  0; 

corrects  •  0; 

correct4  «  0; 

aa 

\  Evaluate  V  to  choose  significant  features 
%  Note,  use  an  average  of  the  ten  most  significant 
%  eigenvectors 

useus  •  ab8(V( : , 247 ! 256  ) ' ] ; 

useus  *  mean (useus); 

featselect  ■  ( ] ; 
for  numfeats  ■  l:kk 

[wch,ldx]  -  max  (useus  ),- 
useu6(idx)  **  -100; 

featselect  -  [featselect  Idx}; 

end 

if  train  —  1 

trncl  -  trn9a256 (:, featselect) ; 
trnc2  -  trnaa256 (:, featselect) ; 
trnc3  -  trnf 2256 (:, featselect) ; 
trnc4  -  trnfa256( :, featselect) ; 

I  TRN  set  has  been  previously  aligned 

%  Select  how  many  relevant  eigenvectors 
\  you  wish  to  use 

relnum  -  kk; 

%  Implement  the  training  portion  of  a  Gaussian  classifier, 

%  recursively  confuting  computing  mean  and  variances, 
t  Note  that  the  data  sets  were  previously  aligned  before 
I  extraction  from  the  AGC  algorithm 

'training  9a' 

mx9a  -  trncl(l, : ) ; 

vx9a  •  zeros(l, size(trncl, 2) ) ; 

for  i  -  2:size(trncl,l) 

currx  -  trncl(i, : ) ; 

mx9a  -  ( (i-1) *mx9a+currx)/i; 

vx9a  -  (( (i-l)*vx9a)/i)  +  (  (mx9a-currx)  .''2)/(i-l); 
end 


99 


'training  aa' 

mxaa  -  trnc2(l, : ) ; 

vxaa  -  zeros(l, size(trnc2, 2) ) ; 

for  i  -  2:8lze(trnc2,l) 

currx  •  trnc2(i,  : )  ; 

mxaa  -  {{l-l)*mxaa+currx)/i; 

vxaa  -  ( ( (i'l) *vxaa)/i)  +  ( (mxaa-currx)  .'*2)/(i*l)/ 
end 


'training  f2' 

mxf2  -  trnc3(l, : ); 

vxf2  -  zero8(l,8ize{trnc3,2}}; 

for  1  -  2 :aize(trnc3, 1) 

currx  -  trnc3 (i, : }  ; 

inxf2  ■  {{i-l)*iiucf2+currx)/i; 

vxf2  -  ((a-l)*vxf2)/i)  +  ((mxf2-currx).^2)/(i-l)? 
end 

'training  fa' 

nucfa  -  trnc4(l,  : )  ; 

vxfa  -  zeros(l, 8ize(trnc4 , 2) ) ; 

for  i  -  258lze(trnc4,l) 

currx  -  trnc4(i, ! ) ; 

mxfa  -  { (l-l)*mxfa+currx)/l; 

vxfa  -  ( ( (i-l)*vxfa)/i)  +  (  (mxfa-currx)  .'‘2)/(l'l); 
end 

%  Calculate  bias  terms: 
b9a  -  suffl(log(vx9a} } ; 
baa  ■  8Uffl(log(vxaa} } ; 
bf2  ■  suffl(log(vxf2} } ; 
bfa  -  8ua(log(vxfa) ) ; 

endliftraln 


cltest  -  t8t9a256; 
c2te8t  ■  t8taa256; 
cBtest  «  t8tf22SS; 
c4te8t  •  t6tfa256; 

t  Perform  feature  selection  on  the  test  sets 

cltest  ••  t8t9a256( : ,  featselect) ; 
e2te8t  *  tstaa256( : , featselect) ; 
c3test  tstf22S6( : ,  featselect) ; 
c4te8t  t8tfa25f( featselect) ; 

%  In  all  of  the  following  "dxy''  implies  one  is  comparing 
%  the  xth  class  mean  with  an  exeiq)lar  from  class  y.  Kote  that 
t  1-4  corresponds  to  9a-fa  as  shown  at  the  beginning  of  the  EDBFM 
\  analysis  program 

%  'testing  9a' 

for  i  -  l:8ize(cltest,l) 

[i  8ize(clte8t, 1} ] ; 

xt  -  oltest(l, ! ) ; 

I  Give  this  subroutine  the  current  exemplar 
t  the  mean  for  the  respective  class,  the 
t  variance  for  the  respective  class,  the 
%  bias  for  the  respective  class,  and  the 
I  number  of  features  used. 

dll  -  decisionwBhiftkk(xt,mx9a,vx9a,b9a,k}c); 
d21  -  declsionwshiftkk(xt, mxaa, vxaa, baa, kk); 
d31  -  declsionw8hiftkk(xt,mxf2,vxf2,bf2,kk); 
d41  -  deci8ionwshlftkk(xt, mxfa, vxfa, bfa, kk); 

t  Mow  classify  the  data,  saving  relevant  information  for  later 

[duin,idx]  -  mln(  (dll  d21  d31  d41] }  ; 
if  idx  —  1 

correctl  -  correctl  +  1; 
end 

correctl; 
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end 


%  'testing  aa' 

for  i  -  l:size(c2test,l) 

[i  size{c2test, 1) ) ; 

xt  -  c2te8t(i,  : ) ; 

dll  -  deciBionw8hiftkk(xt,iioc9a,vx9a,b9a,kk}; 
d21  •*  decl8ionwshiftkk(xt,mxaaiVxaaibaa,kk}; 
d31  -  deci8ionwahiftkk{xt,mxf2,vxf2,bf2,kk}; 
d41  -  deci8ionw8hiftkk(xt,mxfa,vxfa,bfa,kk); 

\  Now  classify  the  data,  saving  relevant  information  for  later 

[dum,  idx]  -  minddll  d21  d31  d41]}; 
if  idx  —  2 

correct2  -  correct2  +  1; 
end 

correct2 ; 
end 

I  'testing  f2' 

for  i  -  I:8i2e(c3te8t,l) 

[i  8ize(c3te8t, 1) ] ; 

xt  -  c3te8t(l, ! ) ; 

dll  -  deciaionw8hiftkk(xt,mx9a,vx9a,b9a,kk); 
d21  -  decialonvshiftkk(xt,mxaa,vxaa,baa,kk); 
d31  -  decialonw8hiftkk{xt,mxf2,vxf2,bf2,kk); 
d41  -  deeialonvshiftkk(xt,mxfa,vxfa,bfa,kk); 

I  Now  classify  the  data,  saving  relevant  information  for  later 

(dum, idx]  ■  fflin( (dll  d21  d31  d41] ) ; 
if  idx  —  3 

correct3  -  correct!  +  1; 
end 

correct!; 


\  'testing  fa' 

for  1  ■  l!8ize(c4test,l) 

[1  8ize{c4te8t,l}]; 

xt  ■  c4te8t(l, ! )  ; 

dll  -  declsionv8hiftkk(xt,mx9a,vx9a,b9a,kk); 
d21  •  decl6ionw8hiftkk(xt,mxaa,vxaa,baa,kk); 
d31  -  deci8ionw8hiftkk(xt,mxf2,vxf2,bf2,kk}; 
d41  -  decisionwshiftkk(xt,mxfa,vxfa,bfa,kk}; 

%  Now  classify  the  data,  saving  relevant  information  for  later 

(dum, idx]  -  mln((dll  d21  d31  d41]]; 
if  idx  —  4 

correct4  -  correct4  +  1; 
end 

correct4; 

end 

re8ult8{aa,l)  -  correctl; 
reaults(aa,2)  -  correct!; 
results (aa, 3}  ■  correct!; 
results (aa, 4)  -  correct4; 

aa  ■  aa+1; 

if  aa  -■  1 
reault8(l, : ] 
end 

endtforkk 

save  orig4rateswshifts  results 
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C.2.4  Subroutines. 


\  CHITEST  SUBRODTINE 

%  This  subroutine  performs  the  intraclass  chitest 
\  The  "col"  variable  corresponds  to  the  appropriate 
%  class  being  compared  against 

functionfgoodonly]  -  chitest(x,chialpha,col) 
tx  -  d9a; 

Ichialpha  -  chialpha; 

%col  -1  ; 

goodonly  -  11 ; 

Isize(x) 

for  i  -  l:8ize(x,l) 

if  x(i,col)  <  chialpha 

goodonly  -  {goodonly;  i]; 

end 

end 

\  LTEST  SUBRODTINE 

%  Same  as  above,  but  Lmin  insures  at  least  a  minimum 
\  number  of  exemplars  are  used  from  the  other  classes 

f unction(goodonly]  -  Ltest (x, chialpha , col, Lml ) 

goodonly  -  (]; 
wx  -size{x, 1) ; 
for  1  -  l:8lze(x,l) 

[l,ane] ; 

if  x(i,col)  <  chialpha 
goodonly  -  {goodonly;  i]; 
end 
end 

chk  -  8lze(goodonly, 1} ; 

if  ehk  <  Lml 
[duffl  vch]  ■  8ort(x) ; 
goodonly  •  weh{l:Lml); 
endtif 

9lze(vch); 

\  FINDKEAREST  SUBRODTINE 

\  This  subroutine  finds  the  nearest  in  the  sense  of  Hahalanobis  distance, 
t  as  prescribed  by  Lee  and  Landgrebe 

function(winner]  -  findneareat(reference,te8t8ig8,var) 

{numsigs, feats]  -  size(testsig8); 

bestmatch  -  0; 

dists  -  zeros(numsig8, 1) ; 

var-var; 


for  1  -  linumsigs 
1; 

xtrlal  -  te8tsig8(i, :  ); 

(lx,vx]  -  8ize(xtrial) ; 

(lref,wref]  -  8ize(reference) ; 

dl8t8(l)  -  (var."‘{-l) ) .*(xtrial-reference)*(xtrlal-reference) ' ; 
end 

[Y.maxind]  -min(di8ts); 
winner  -  test8igs(maxind, : ) ; 

I  COMPDTENP  SUBROUTINE 

\  This  subroutine  implements  Lee's  equations 

%  The  difference  between  this  and  the  test  function 
t  is  that  one  assumes  uncorrelated  features  (variances 
%  are  in  a  vector) 

X 1 ; x2 ; ml ; m2 ; vl ;  v2 ; 
ml  -  ml'  ;m2  -  m2'  ; 
vl  -  vl' ;v2  -  v2' ; 
xl  -  xl' ;x2  -  x2' ; 
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vO  -  xl; 

V  -  x2-xl; 

sml“size(ml) ; 
sinvl"size{ Invl) ; 
siQ2-size(in2} ; 
sinv2-size(inv2); 

c  -  .5  *  (ml'*invl*ml  -  m2'*inv2*m2)  +  .5*lQg(  detl/det2  ); 
cprime  -  .5*vO'  *  (invl-lnv2)  *  vO  -  (icil'*invl-m2'*lnv2)*vO  +  c; 

b  -  v0'*(lnvl*lnv2)*v  -  {ml'*lnvl-m2'*inv2)*v; 
a  -  .5  *  v'  *  (invl-inv2)  *  v; 

t  -  log(probl/prob2) ; 

t  Kote:  this  step  finds  the  appropriate  root  when  one 

%  as  two  to  choose  from 

if  a  —  0 

u  -  (t-cprlm6)/b; 

else 

ul  -  (-b  +  sqrt(b''2-4*a*(cprlme-t}))/(2*a); 
u2  -  (-b  ■  sqrt(b'‘2-4*a*(eprime*t) ) )/(2*a) ; 
if  orient— 1 
u  -  min( (ul  u2] } ; 
else 

u  -  niax(  (ul  u2] )  ; 

end 

end 

p  -  u*v+vO; 

N  -  {lnvl*inv2)*p  +  (inv2*ml  -  invl*ia2)/ 

P-p; 

N  -  K/8qrt(N'*N); 

I  DECISIONWSHIFTSKK  SUBROUTINE 
I  This  subroutine  Implements  the  classification 
I  scheme  like  that  of  the  AGC.  In  other  words, 

\  exemplars  are  shifted  up  to  18  bins  in  either 
%  direction  and  matched  to  each  template  and  then 
%  a  winner  is  declared  in  the  main  program. 

)  This  subroutine  just  does  the  correlation/shifting 
t  portion,  kk  represents  how  many  features  are 
4  being  used.  In  this  particular  version  of  the 
t  subroutine  a  single  number  is  returned,  but  the  program 
\  may  be  easily  modified  to  return  the  shifted  signature 
4  if  necessary 

44444444444444444444444444444444444444444444444444444444444 

functlon(di8t]  •  deci8ionw8hift(x, template, var,b,kk) 

4  This  routine  performs  a  circular  convolution 
4  on  a  set  of  signals  vs  a  template 

Ibound  ■  -18; 
ubound  -  16; 

if  abs(lbound)  >  kk 
Ibound  -  -kk+2; 
ubound  -  kk-2; 
end 

if  kk  —  2 
Ibound  ■  0; 
ubound  -  0; 
end 

if  kk  —  1 
Ibound  ■  0; 
ubound  -  0; 
end 

(wx, lx]  -  8ize(x) ; 
xtrial  -  zeros(l,lx}; 

alignedsig  •  zeros (1, lx) ; 
shiftstore  -  2ero8(wx,l}; 

reference  -  template; 

4reference  -  reference/8qrt(reference*reference' ) ; 
4reference(l:5) 


%  Find  offsets  via  circular  shifting 
xtemp  -  x; 
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%  xtemp  -  xtemp/8qrt(xtemp*xtemp' )  ,• 

Ixtenp (1:5) 

mdist  -  zeros(l,tryahlft-lbound+l); 
bestshift  -  0; 

for  tryshift  -  lbound:ubound 
aa  -  tryshift*lbound+l; 

if  tryshift  <  0 

xtrial(aa,:)  -  [xtemp(-try8hift+l:lx)  xtemp(l:*try8hlft)]; 
elseif  tryshift  —  0 
xtrial{aa,  :)  -  xtemp; 
elseif  tryshift  >  0 

xtrial{aa,:}  -  [xtemp(ix*try8hift+l;lx)  xtemp(l:lx*tryshift)]/ 
end 


mdist(aa)  - 

.  5*(var .''( *1) ) .  *(xtrlal(aa, : )  -  reference) *(xtrlal(aa, : )  -reference)  '  +  .5  •  b; 
endtfor 

(dist, bestshift]  -  min(indt8t); 
alignedsig  -  xtrial(best8hift,  ;); 
bestshift  -  bestshlft-l; 
shiftstore  -  bestshift; 
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these  measured  signals  from  frontal  aspect  angles  of  four  aircraft  classes,  the  baseline  performance  of  the  Adaptive 
Gaussian  Classifier  (AGC)  is  tested  with  respect  to  aligning  exemplars  to  templates.  Alignment  plays  a  crucial  role 
in  the  AGC’s  classification  performance  which  can  degrade  by  11%  for  a  target  class.  The  AGC  is  compared  to 
non-parametric  classifiers,  but  no  statistically  significant  degradation  of  performance  is  found.  Data  separability  is 
analyzed  by  bounding  the  Bayes  error.  The  data  is  well  separated  in  a  statistical  sense.  A  feature  selection  algorithm, 
based  on  analysis  of  the  decision  boundary,  is  applied  to  find  a  reduced  feature  set,  which  are  linear  combinations 
of  the  original  features.  These  features  are  optimized  with  respect  to  classification  error  rather  than  reconstruction 
error.  This  technique  is  extended  to  deduce  the  relevant  features  in  the  original  feature  space.  Fewer  than  5%  of 
the  features  in  the  original  feature  space  may  be  used  to  attain  an  improved  classification  rate.  This  new  method 
is  a  true  reduction  of  features  and  shows  improvement  up  to  15%.  Discrimination  of  UHRR  radar  signatures  using 
a  multiresolution  analysis  is  proposed.  The  decision  boundary  analysis  chooses  relevant  wavelet  scales  with  respect 
to  classification.  Some  improved  performance  against  an  entropy  based  measure  is  observed  for  limited  feature  sets, 
j  The  technique  developed  here  successfully  chooses  the  scale  that  causes  classification  performance  to  peak  within  5% 
I  of  the  performance  in  the  full-dimensional  or  reduced-dimensional  UHRR  radar  signature  space. 
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